# Atomic

## arch/x86/include/asm/barrier.h

```c
/*
 * Force strict CPU ordering.
 * And yes, this is required on UP too when we're talking
 * to devices.
 */

#ifdef CONFIG_X86_32
/*
 * Some non-Intel clones support out of order store. wmb() ceases to be a
 * nop for these.
 */
#define mb() alternative("lock; addl $0,0(%%esp)", "mfence", X86_FEATURE_XMM2)
#define rmb() alternative("lock; addl $0,0(%%esp)", "lfence", X86_FEATURE_XMM2)
#define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)
#else
#define mb() 	asm volatile("mfence":::"memory")
#define rmb()	asm volatile("lfence":::"memory")
#define wmb()	asm volatile("sfence" ::: "memory")
#endif
```

1. If cpu doesn't support `mfence,lfence,sfence`, use `lock` instruction. The LOCK prefix ensures that the CPU has exclusive ownership of the appropriate cache line for the duration of the operation, and provides certain additional ordering guarantees. This may be achieved by asserting a bus lock, but the CPU will avoid this where possible. 

2. [Lock & memory order](https://stackoverflow.com/questions/60332591/why-is-lock-a-full-barrier-on-x86)
    
   [memory order consume](https://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/)

3. [缓存一致性MESI](https://cloud.tencent.com/developer/article/1548942)

MESI优化和他们引入的问题
缓存的一致性消息传递是要时间的，这就使其切换时会产生延迟。当一个缓存被切换状态时其他缓存收到消息完成各自的切换并且发出回应消息这么一长串的时间中CPU都会等待所有缓存响应完成。可能出现的阻塞都会导致各种各样的性能问题和稳定性问题。

CPU切换状态阻塞解决-存储缓存（Store Bufferes）

比如你需要修改本地缓存中的一条信息，那么你必须将I（无效）状态通知到其他拥有该缓存数据的CPU缓存中，并且等待确认。等待确认的过程会阻塞处理器，这会降低处理器的性能。应为这个等待远远比一个指令的执行时间长的多。

Store Bufferes

为了避免这种CPU运算能力的浪费，Store Bufferes被引入使用。处理器把它想要写入到主存的值写到缓存，然后继续去处理其他事情。当所有失效确认（Invalidate Acknowledge）都接收到时，数据才会最终被提交。

这么做有两个风险

Store Bufferes的风险

第一、就是处理器会尝试从存储缓存（Store buffer）中读取值，但它还没有进行提交。这个的解决方案称为Store Forwarding，它使得加载的时候，如果存储缓存中存在，则进行返回。

第二、保存什么时候会完成，这个并没有任何保证。

**写屏障 Store Memory Barrier(a.k.a. ST, SMB, smp_wmb)是一条告诉处理器在执行这之后的指令之前，应用所有已经在存储缓存（store buffer）中的保存的指令。**

**读屏障Load Memory Barrier (a.k.a. LD, RMB, smp_rmb)是一条告诉处理器在执行任何的加载前，先应用所有已经在失效队列中的失效操作的指令。**

**这正是为什么 release和acquire要成对出现的原因。**

4. [MESI](https://en.wikipedia.org/wiki/MESI_protocol)

**Store Buffer**

A store buffer is used when writing to an invalid cache line. Since the write will proceed anyway, the CPU issues a read-invalid message (hence the cache line in question and all other CPUs' cache lines that store that memory address are invalidated) and then pushes the write into the store buffer, to be executed when the cache line finally arrives in the cache.

A direct consequence of the store buffer's existence is that when a CPU commits a write, that write is not immediately written in the cache. Therefore, whenever a CPU needs to read a cache line, it first has to scan its own store buffer for the existence of the same line, as there is a possibility that the same line was written by the same CPU before but hasn't yet been written in the cache (the preceding write is still waiting in the store buffer). Note that while a CPU can read its own previous writes in its store buffer, other CPUs cannot see those writes before they are flushed from the store buffer to the cache - a CPU cannot scan the store buffer of other CPUs.

**Invalidate Queues**

With regard to invalidation messages, CPUs implement invalidate queues, whereby incoming invalidate requests are instantly acknowledged but not in fact acted upon. Instead, invalidation messages simply enter an invalidation queue and their processing occurs as soon as possible (but not necessarily instantly). Consequently, a CPU can be oblivious to the fact that a cache line in its cache is actually invalid, as the invalidation queue contains invalidations that have been received but haven't yet been applied. Note that, unlike the store buffer, the CPU can't scan the invalidation queue, as that CPU and the invalidation queue are physically located on opposite sides of the cache.

As a result, memory barriers are required. A store barrier will flush the store buffer, ensuring all writes have been applied to that CPU's cache. A read barrier will flush the invalidation queue, thus ensuring that all writes by other CPUs become visible to the flushing CPU. Furthermore, memory management units do not scan the store buffer, causing similar problems. This effect is visible even in single threaded processors.[7]


5. [Memory Consistency Models](https://www.cs.utexas.edu/~bornholt/post/memory-models.html)

6. Compiler Barriers

对编译器的优化我们可以使用compiler barrier，比如大家熟知的"volatile"，就可以让编译器生成的代码，每次都从内存重新读取变量的值，而不是用寄存器中暂存的值。因为在多线程环境中，不会被当前线程修改的变量，可能会被其他的线程修改，从内存读才可靠。

这就部分解释了上文留的那个问题，即为什么要用READ_ONCE()和WRITE_ONCE()这两个宏，因为atomic_read()和atomic_set()所操作的这个变量，可能会被多核/多线程同时修改，需要避免编译器把它当成一个普通的变量，做出错误的优化。还有一部分原因是，这两个宏可以作为标记，提醒编程人员这里面是一个多核/多线程共享的变量，必要的时候应该加互斥锁来保护。

Linux中设置compiler barrier的函数是barrier()，它对应gcc的实现是这样的（定义在include/linux/compiler-gcc.h）：

/* The "volatile" is due to gcc bugs */
#define barrier() __asm__ __volatile__("": : :"memory")
这是一个内嵌汇编，里是一个空的指令，空的指令怎么发挥作用？

它其实利用了末尾clobber list里的"memory"，clober list是gcc和gas(GNU Assembler)的接口，用于gas通知gcc它对寄存器和memory的修改情况。

这里的"memory"就是告知gcc，在汇编代码中，我修改了内存中的内容，之前的C代码块和之后的C代码块看到的内存是不一样的，对内存的访问不能依赖于嵌入汇编之前的C代码块中寄存器的内容，所以乖乖地重新从内存读数据吧。

也不知道编译器能不能识别这种伎俩，反正最后它是欣然的被骗了。需要注意的是，barrier()只会对编译器的行为产生约束，它不会生成真正的指令，因此对最终CPU的指令执行没有影响。

Linux还提供了一个函数叫smp_mb()，看起来好像是专门用于SMP系统的memory barrier，那它能提供SMP系统中，不同CPU对内存访问顺序的保证吗？不能，SMP系统中，它就等同于mb()，在UP系统中，它会退化为compiler barrier。

----------------------------

```c
#ifdef CONFIG_SMP
#define smp_mb()	mb()
#else	
#define smp_mb()	barrier()
#endif
```

smp_mb()并不像很多人理解的那样，是mb()的超集(superset)，相反，它只能算mb()的子集(subset)。能用smp_mb()的地方可以用mb()代替，但能用mb()的地方不一定能用smp_mb()代替。

即使在多核系统上smp_mb也比mb弱。smp_mb的作用范围局限在CPU cores之间，mb的作用范围包括CPU cores和SoC上其他模块。

1. [linux barrier](https://zhuanlan.zhihu.com/p/96001570)

2. [barrier and linux kernel](https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/memory-access-ordering-part-2---barriers-and-the-linux-kernel)

----------------------------

```c
/* Optimization barrier */
/* The "volatile" is due to gcc bugs */

#define barrier() __asm__ __volatile__("": : :"memory")
```

1. this is the barrier definition. It's a compiler barrier.

`volatile` and `memory` both tell the compiler read value from memory not from registers. All memory access operations before the barrier can't be reordered behind it and same as the operations behind it.

----------------------------------

```c
#ifdef CONFIG_X86_PPRO_FENCE
#define dma_rmb()	rmb()
#else
#define dma_rmb()	barrier()
#endif
#define dma_wmb()	barrier()

#ifdef CONFIG_SMP
#define smp_mb()	mb()
#define smp_rmb()	dma_rmb()
#define smp_wmb()	barrier()
#define smp_store_mb(var, value) do { (void)xchg(&var, value); } while (0)
#else /* !SMP */
#define smp_mb()	barrier()
#define smp_rmb()	barrier()
#define smp_wmb()	barrier()
#define smp_store_mb(var, value) do { WRITE_ONCE(var, value); barrier(); } while (0)
#endif /* SMP */
```

1. `barrier()` is just a compiler barrier. For UP, compiler barrier is enougth. 

2. for SMP, `smp_wmb` only need compiler barrier, because for x86, 而在x86中，对于同一CPU执行的load指令后接load指令（L-L），store指令后接store指令（S-S），load指令后接store指令（L-S），都是不能交换指令的执行顺序的，只有store指令后接load指令（S-L）时才可以[注1]。这种memory order被称为TSO(Total Store Order)，俗称strong order。

也就是说，`write` 在cpu中不会重排到其他write前面，既然cpu不这么做，我们只要约束compiler不重排就好了

3. 对于 rmb和mb，就要用`mfence` 和 `lfence` 来约束cpu的行为

-------------------------

```c
#if defined(CONFIG_X86_PPRO_FENCE)

/*
 * For this option x86 doesn't have a strong TSO memory
 * model and we should fall back to full barriers.
 */

#define smp_store_release(p, v)						\
do {									\
	compiletime_assert_atomic_type(*p);				\
	smp_mb();							\
	WRITE_ONCE(*p, v);						\
} while (0)

#define smp_load_acquire(p)						\
({									\
	typeof(*p) ___p1 = READ_ONCE(*p);				\
	compiletime_assert_atomic_type(*p);				\
	smp_mb();							\
	___p1;								\
})

#else /* regular x86 TSO memory ordering */

#define smp_store_release(p, v)						\
do {									\
	compiletime_assert_atomic_type(*p);				\
	barrier();							\
	WRITE_ONCE(*p, v);						\
} while (0)

#define smp_load_acquire(p)						\
({									\
	typeof(*p) ___p1 = READ_ONCE(*p);				\
	compiletime_assert_atomic_type(*p);				\
	barrier();							\
	___p1;								\
})

#endif

```

1. [acquire-release](https://preshing.com/20120913/acquire-and-release-semantics/)

![acquire-release](resources/02.png)

2. For `release`, `barrier` is before the `write`, and for `acquire`, `barrier` is after the `read`

3. 

* `lfence` = L-L + L-S

**LFENCE** Performs a serializing operation on all **load-from-memory** instructions that were issued **prior** the LFENCE instruction. Specifically, LFENCE does not execute until all prior instructions have completed locally, and **no later instruction** begins execution until LFENCE completes. 

lfence 保证了前面所有的**load**不会排到后面，而后面所有的**load/store**不会排到前面，所以保证了L-L，L-S


* `sfence` = S-S

**SFENCE** the processor ensures that every **store prior** to SFENCE is globally visible before any **store after** SFENCE becomes globally visible

sfence 保证所有之前的**store**不会排到后面，所有之后的**store**不会排到前面，所以保证了 S-S

* `mfence` = L-L + L-S + S-S + S-L

**MFENCE** Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction.


4. 由于有store buffer的存在，即使没有执行的乱序，也会有memory order的乱序。比如

Store(a) - Load(b)

store(a) 暂时存在了store buffer里，没有进cache，其他cpu不可见。Load(b)完成后，a才刷进cache，在其他cpu看来，就是先 load(b)，再 store(a)


个人理解：

对于x86架构来讲，S-L 是cost最大的memory fence，而cpu也默认并不保证这个order。对于 L-L, L-S，只要保证执行的instruction是按照program 的order执行的，就可以保证memory order。对于S-S，除了保证执行的instruction按照program的order外，还要保证flush store buffer to cache的时候是按照FIFO顺序执行的就可以，并不要求全部完成。

但是对于S-L的保证，必须要先保证flush store buffer to cache全部完成了，对所有cpu可见了，才能load。而flush store buffer的过程是很耗时的，牵扯到跟其他cpu的通信。所以默认并没有S-L的保证。如果需要S-L,就用`mfence`。


5. 对于intel的x86架构cpu，lfence/sfence是redundant的。因为这种架构的cpu只允许S-L reordering。所以其实只需要mfence [(intel fence)](https://www.anycodings.com/1questions/1791050/does-the-intel-memory-model-make-sfence-and-lfence-redundant)

6. 对于上面的代码，如果cpu不支持 strong TSO memory model，那么我们在写之前要加入入一个`mfence`，而度之后要加一个`mfence`。对于已经支持strong TSO的，只需要加一个compile barrier即可

------------------------------------


## tools/include/linux/compiler.h

```c
/*
 * Following functions are taken from kernel sources and
 * break aliasing rules in their original form.
 *
 * While kernel is compiled with -fno-strict-aliasing,
 * perf uses -Wstrict-aliasing=3 which makes build fail
 * under gcc 4.4.
 *
 * Using extra __may_alias__ type to allow aliasing
 * in this case.
 */
typedef __u8  __attribute__((__may_alias__))  __u8_alias_t;
typedef __u16 __attribute__((__may_alias__)) __u16_alias_t;
typedef __u32 __attribute__((__may_alias__)) __u32_alias_t;
typedef __u64 __attribute__((__may_alias__)) __u64_alias_t;

static __always_inline void __read_once_size(const volatile void *p, void *res, int size)
{
	switch (size) {
	case 1: *(__u8_alias_t  *) res = *(volatile __u8_alias_t  *) p; break;
	case 2: *(__u16_alias_t *) res = *(volatile __u16_alias_t *) p; break;
	case 4: *(__u32_alias_t *) res = *(volatile __u32_alias_t *) p; break;
	case 8: *(__u64_alias_t *) res = *(volatile __u64_alias_t *) p; break;
	default:
		barrier();
		__builtin_memcpy((void *)res, (const void *)p, size);
		barrier();
	}
}

static __always_inline void __write_once_size(volatile void *p, void *res, int size)
{
	switch (size) {
	case 1: *(volatile  __u8_alias_t *) p = *(__u8_alias_t  *) res; break;
	case 2: *(volatile __u16_alias_t *) p = *(__u16_alias_t *) res; break;
	case 4: *(volatile __u32_alias_t *) p = *(__u32_alias_t *) res; break;
	case 8: *(volatile __u64_alias_t *) p = *(__u64_alias_t *) res; break;
	default:
		barrier();
		__builtin_memcpy((void *)p, (const void *)res, size);
		barrier();
	}
}

/*
 * Prevent the compiler from merging or refetching reads or writes. The
 * compiler is also forbidden from reordering successive instances of
 * READ_ONCE, WRITE_ONCE and ACCESS_ONCE (see below), but only when the
 * compiler is aware of some particular ordering.  One way to make the
 * compiler aware of ordering is to put the two invocations of READ_ONCE,
 * WRITE_ONCE or ACCESS_ONCE() in different C statements.
 *
 * In contrast to ACCESS_ONCE these two macros will also work on aggregate
 * data types like structs or unions. If the size of the accessed data
 * type exceeds the word size of the machine (e.g., 32 bits or 64 bits)
 * READ_ONCE() and WRITE_ONCE()  will fall back to memcpy and print a
 * compile-time warning.
 *
 * Their two major use cases are: (1) Mediating communication between
 * process-level code and irq/NMI handlers, all running on the same CPU,
 * and (2) Ensuring that the compiler does not  fold, spindle, or otherwise
 * mutilate accesses that either do not require ordering or that interact
 * with an explicit memory barrier or atomic instruction that provides the
 * required ordering.
 */

#define READ_ONCE(x) \
	({ union { typeof(x) __val; char __c[1]; } __u; __read_once_size(&(x), __u.__c, sizeof(x)); __u.__val; })

#define WRITE_ONCE(x, val) \
	({ union { typeof(x) __val; char __c[1]; } __u = { .__val = (val) }; __write_once_size(&(x), __u.__c, sizeof(x)); __u.__val; })

#endif /* _TOOLS_LINUX_COMPILER_H */


#define __READ_ONCE_SIZE						\
({									\
	switch (size) {							\
	case 1: *(__u8 *)res = *(volatile __u8 *)p; break;		\
	case 2: *(__u16 *)res = *(volatile __u16 *)p; break;		\
	case 4: *(__u32 *)res = *(volatile __u32 *)p; break;		\
	case 8: *(__u64 *)res = *(volatile __u64 *)p; break;		\
	default:							\
		barrier();						\
		__builtin_memcpy((void *)res, (const void *)p, size);	\
		barrier();						\
	}								\
})

```

1. `READ_ONCE` `WRITE_ONCE` is compiler barrier to avoid reordering and optimization during compiling

2. union trick

[stack overflow](https://stackoverflow.com/questions/54177247/why-this-union-has-char-array-at-the-end)

[code](https://github.com/torvalds/linux/commit/dd36929720f40f17685e841ae0d4c581c165ea60)

当我们读取的x是一个`const`变量的时候，`typeof(x)` 也会是一个`const`的变量，直接赋值就会报错。所以这里用了一个`union`的trick，`char __c[1]` 的`__c` 就是这个变量的地址，并且不是`const`的。

3. 对于基本变量，加上volatile保证编译器不优化，对于size大于8bytes的，用memory copy，前后加上barrier

--------------------

```c
/* Is this type a native word size -- useful for atomic operations */
#ifndef __native_word
# define __native_word(t) (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
#endif

#define compiletime_assert_atomic_type(t)				\
	compiletime_assert(__native_word(t),				\
		"Need native word sized stores/loads for atomicity.")
```

1. 编译期判断是否是原子类型，只要size是基本类型size，就可以原子操作

-----------------------

# arch/x86/include/asm/cmpxchg.h

```c
/* 
 * An exchange-type operation, which takes a value and a pointer, and
 * returns the old value.
 */
#define __xchg_op(ptr, arg, op, lock)					\
	({								\
	        __typeof__ (*(ptr)) __ret = (arg);			\
		switch (sizeof(*(ptr))) {				\
		case __X86_CASE_B:					\
			asm volatile (lock #op "b %b0, %1\n"		\
				      : "+q" (__ret), "+m" (*(ptr))	\
				      : : "memory", "cc");		\
			break;						\
		case __X86_CASE_W:					\
			asm volatile (lock #op "w %w0, %1\n"		\
				      : "+r" (__ret), "+m" (*(ptr))	\
				      : : "memory", "cc");		\
			break;						\
		case __X86_CASE_L:					\
			asm volatile (lock #op "l %0, %1\n"		\
				      : "+r" (__ret), "+m" (*(ptr))	\
				      : : "memory", "cc");		\
			break;						\
		case __X86_CASE_Q:					\
			asm volatile (lock #op "q %q0, %1\n"		\
				      : "+r" (__ret), "+m" (*(ptr))	\
				      : : "memory", "cc");		\
			break;						\
		default:						\
			__ ## op ## _wrong_size();			\
		}							\
		__ret;							\
	})

/*
 * Note: no "lock" prefix even on SMP: xchg always implies lock anyway.
 * Since this is generally used to protect other memory information, we
 * use "asm volatile" and "memory" clobbers to prevent gcc from moving
 * information around.
 */
#define xchg(ptr, v)	__xchg_op((ptr), (v), xchg, "")

```

1. `xchg` doesn't need `lock` prefix. Others need `lock` for atomic operation.

2. Notice: `lock` in `__xchg_op` is a macro parameter ...

----------------------


```c
/*
 * Constants for operation sizes. On 32-bit, the 64-bit size it set to
 * -1 because sizeof will never return -1, thereby making those switch
 * case statements guaranteeed dead code which the compiler will
 * eliminate, and allowing the "missing symbol in the default case" to
 * indicate a usage error.
 */
#define __X86_CASE_B	1
#define __X86_CASE_W	2
#define __X86_CASE_L	4
#ifdef CONFIG_64BIT
#define __X86_CASE_Q	8
#else
#define	__X86_CASE_Q	-1		/* sizeof will never return -1 */
#endif

/*
 * xadd() adds "inc" to "*ptr" and atomically returns the previous
 * value of "*ptr".
 *
 * xadd() is locked when multiple CPUs are online
 * xadd_sync() is always locked
 * xadd_local() is never locked
 */
#define __xadd(ptr, inc, lock)	__xchg_op((ptr), (inc), xadd, lock)
#define xadd(ptr, inc)		__xadd((ptr), (inc), LOCK_PREFIX)
#define xadd_sync(ptr, inc)	__xadd((ptr), (inc), "lock; ")
#define xadd_local(ptr, inc)	__xadd((ptr), (inc), "")

#define __add(ptr, inc, lock)						\
	({								\
	        __typeof__ (*(ptr)) __ret = (inc);			\
		switch (sizeof(*(ptr))) {				\
		case __X86_CASE_B:					\
			asm volatile (lock "addb %b1, %0\n"		\
				      : "+m" (*(ptr)) : "qi" (inc)	\
				      : "memory", "cc");		\
			break;						\
		case __X86_CASE_W:					\
			asm volatile (lock "addw %w1, %0\n"		\
				      : "+m" (*(ptr)) : "ri" (inc)	\
				      : "memory", "cc");		\
			break;						\
		case __X86_CASE_L:					\
			asm volatile (lock "addl %1, %0\n"		\
				      : "+m" (*(ptr)) : "ri" (inc)	\
				      : "memory", "cc");		\
			break;						\
		case __X86_CASE_Q:					\
			asm volatile (lock "addq %1, %0\n"		\
				      : "+m" (*(ptr)) : "ri" (inc)	\
				      : "memory", "cc");		\
			break;						\
		default:						\
			__add_wrong_size();				\
		}							\
		__ret;							\
	})

/*
 * add_*() adds "inc" to "*ptr"
 *
 * __add() takes a lock prefix
 * add_smp() is locked when multiple CPUs are online
 * add_sync() is always locked
 */
#define add_smp(ptr, inc)	__add((ptr), (inc), LOCK_PREFIX)
#define add_sync(ptr, inc)	__add((ptr), (inc), "lock; ")

```

1. For `add` we need `lock` prefix to guarantee atomic. Yes, this is no magic, just lock the memory bus.

----------------------------

## include/linux/atomic.h

```c
#ifndef atomic_inc_unless_negative
static inline int atomic_inc_unless_negative(atomic_t *p)
{
	int v, v1;
	for (v = 0; v >= 0; v = v1) {
		v1 = atomic_cmpxchg(p, v, v + 1);
		if (likely(v1 == v))
			return 1;
	}
	return 0;
}
#endif

#ifndef atomic_dec_unless_positive
static inline int atomic_dec_unless_positive(atomic_t *p)
{
	int v, v1;
	for (v = 0; v <= 0; v = v1) {
		v1 = atomic_cmpxchg(p, v, v - 1);
		if (likely(v1 == v))
			return 1;
	}
	return 0;
}
#endif
```

1. classical method

--------------------

```c
/*
 * The idea here is to build acquire/release variants by adding explicit
 * barriers on top of the relaxed variant. In the case where the relaxed
 * variant is already fully ordered, no additional barriers are needed.
 */
#define __atomic_op_acquire(op, args...)				\
({									\
	typeof(op##_relaxed(args)) __ret  = op##_relaxed(args);		\
	smp_mb__after_atomic();						\
	__ret;								\
})

#define __atomic_op_release(op, args...)				\
({									\
	smp_mb__before_atomic();					\
	op##_relaxed(args);						\
})

#define __atomic_op_fence(op, args...)					\
({									\
	typeof(op##_relaxed(args)) __ret;				\
	smp_mb__before_atomic();					\
	__ret = op##_relaxed(args);					\
	smp_mb__after_atomic();						\
	__ret;								\
})


#ifndef atomic_add_return_acquire
#define  atomic_add_return_acquire(...)					\
	__atomic_op_acquire(atomic_add_return, __VA_ARGS__)
#endif

#ifndef atomic_add_return_release
#define  atomic_add_return_release(...)					\
	__atomic_op_release(atomic_add_return, __VA_ARGS__)
#endif

#ifndef atomic_add_return
#define  atomic_add_return(...)						\
	__atomic_op_fence(atomic_add_return, __VA_ARGS__)
#endif

```

1. 定义原子操作的memory order特性。`acquire` 在后面加上`smp_mb`，`release` 在前面加上`smp_mb`，对于full ordered，也就是`op_fence`，前后都加上`smp_mb`。对于`relaxed`，前后都不加memory barrier，只是单纯的atomic 操作

-------------

## arch/x86/include/asm/bitops.h
位运算一些原子、非原子操作

```c
/*
 * We do the locked ops that don't return the old value as
 * a mask operation on a byte.
 */
#define IS_IMMEDIATE(nr)		(__builtin_constant_p(nr))
#define CONST_MASK_ADDR(nr, addr)	BITOP_ADDR((void *)(addr) + ((nr)>>3))
#define CONST_MASK(nr)			(1 << ((nr) & 7))
```

1. `__builtin_...` gcc builtin commands

2. `CONST_MASK_ADDR` get the bit op byte address position. `void*` is size of 8 bytes, so `nr>>3`.

-------------------

```c
/**
 * set_bit - Atomically set a bit in memory
 * @nr: the bit to set
 * @addr: the address to start counting from
 *
 * This function is atomic and may not be reordered.  See __set_bit()
 * if you do not require the atomic guarantees.
 *
 * Note: there are no guarantees that this function will not be reordered
 * on non x86 architectures, so if you are writing portable code,
 * make sure not to rely on its reordering guarantees.
 *
 * Note that @nr may be almost arbitrarily large; this function is not
 * restricted to acting on a single-word quantity.
 */
static __always_inline void
set_bit(long nr, volatile unsigned long *addr)
{
	if (IS_IMMEDIATE(nr)) {
		asm volatile(LOCK_PREFIX "orb %1,%0"
			: CONST_MASK_ADDR(nr, addr)
			: "iq" ((u8)CONST_MASK(nr))
			: "memory");
	} else {
		asm volatile(LOCK_PREFIX "bts %1,%0"
			: BITOP_ADDR(addr) : "Ir" (nr) : "memory");
	}
}
```

1. note: `nr` is arbitrarily.

2. using `lock` to guarantee atomic. `lock` is actually have the function of `mfence`

------------------