## Complementary Notebook: Appropriate Operators to Approximate Connectives and Quantifiers

This notebook is a complement to the tutorial on operators (2-grounding_connectives.ipynb).

Logical connectives are grounded in LTN using fuzzy semantics. However, while all fuzzy logic operators make sense when simply *querying* the language, not every operator is equally suited for *learning*.

We will see common problems of some fuzzy semantics and which operators are better for the task of *learning*.

当然，以下是您提供内容的中文翻译：

## 补充笔记本：用于近似连词和量词的适当算子

本笔记本是关于算子的教程（2-grounding_connectives.ipynb）的补充。

逻辑连词在LTN中使用模糊语义进行基础化。然而，尽管在简单查询语言时所有模糊逻辑算子都能合理使用，但并不是每个算子都同样适合用于学习。

我们将看到一些模糊语义的常见问题，以及哪些算子更适合用于学习任务。

In [1]:
import ltn
import torch

### Querying

One can access the implementation of the most common fuzzy semantics in the `ltn.fuzzy_ops` module.
They are implemented using PyTorch primitives.

Here, we compare:
- the product t-norm: $u \land_{\mathrm{prod}} v = uv$,
- the Lukasiewicz t-norm: $u \land_{\mathrm{luk}} v = \max(u+v-1,0)$,
- the minimum aggregator: $\min(u_1,\dots,u_n)$,
- the p-mean error aggregator (generalized mean of the deviations w.r.t. the truth): $\mathrm{pME}(u_1,\dots,u_n) = 1 - \biggl( \frac{1}{n} \sum\limits_{i=1}^n (1-u_i)^p \biggr)^{\frac{1}{p}}$.

Each operator obviously conveys very different meanings, but they can all make sense depending on the intent of the query.

In the following, it is possible to observe that different semantics for the conjunction return very different results.
The same behavior can be observed when comparing different aggregators computed on the same input.

### 查询

可以在 `ltn.fuzzy_ops` 模块中访问最常见的模糊语义的实现。它们使用 PyTorch 原语实现。

在这里，我们比较：
- 乘积 t-范数： $u \land_{\mathrm{prod}} v = uv$，
- Lukasiewicz t-范数： $u \land_{\mathrm{luk}} v = \max(u+v-1,0)$，
- 最小聚合器： $\min(u_1,\dots,u_n)$，
- p-均值误差聚合器（相对于真值的偏差的广义均值）： $\mathrm{pME}(u_1,\dots,u_n) = 1 - \biggl( \frac{1}{n} \sum\limits_{i=1}^n (1-u_i)^p \biggr)^{\frac{1}{p}}$。

每个运算符显然传达了非常不同的意义，但根据查询的意图，它们都可以有意义。

在下面的例子中，可以观察到不同的合取语义返回的结果非常不同。同样的行为也可以在对相同输入计算不同聚合器时观察到。

In [2]:
x1 = torch.tensor(0.4)
x2 = torch.tensor(0.7)

# the stable keyword is explained at the end of the notebook # stable关键字在本文末尾解释
and_prod = ltn.fuzzy_ops.AndProd(stable=False)
and_luk = ltn.fuzzy_ops.AndLuk()

print(and_prod(x1, x2))
print(and_luk(x1, x2))

tensor(0.2800)
tensor(0.1000)


In [3]:
xs = torch.tensor([1., 1., 1., 0.5, 0.3, 0.2, 0.2, 0.1])

# the stable keyword is explained at the end of the notebook # stable关键字在本文末尾解释
forall_min = ltn.fuzzy_ops.AggregMin() # forall_min 是一个最小聚合器
forall_pME = ltn.fuzzy_ops.AggregPMeanError(p=4, stable=False) # forall_pME 是一个 p-均值误差聚合器

print(forall_min(xs, dim=0)) # xs应该是一维张量，dim=0 表示在第0维上进行聚合
print(forall_pME(xs, dim=0))

print(xs.shape)
print(xs)

tensor(0.1000)
tensor(0.3134)
torch.Size([8])
tensor([1.0000, 1.0000, 1.0000, 0.5000, 0.3000, 0.2000, 0.2000, 0.1000])


### Learning

While all operators are suitable in a querying setting, this not the case in a learning setting. Indeed, many fuzzy logic operators have derivatives not suitable for gradient-based algorithms. For more details, read [van Krieken et al., *Analyzing Differentiable Fuzzy Logic Operators*, 2020](https://arxiv.org/abs/2002.06100).

Here, we give simple illustrations of such gradient issues.

#### 1. Vanishing Gradients

Some operators have vanishing gradients on some part of their domains.

For example, in $u \land_{\mathrm{luk}} v = \max(u+v-1,0)$, if $u+v-1 < 0$, the gradients vanish.

In the following, it is possible to observe an edge case in which the Lukasiewicz conjunction leads to vanishing gradients.

当然，以下是您提供的内容的中文翻译：

虽然所有操作符在查询设置中都适用，但在学习设置中情况并非如此。实际上，许多模糊逻辑操作符的导数不适合基于梯度的算法。有关详细信息，请阅读 [van Krieken 等人，*分析可微模糊逻辑操作符*，2020](https://arxiv.org/abs/2002.06100)。

在这里，我们给出了一些关于梯度问题的简单说明。

#### 1. 梯度消失

某些操作符在其定义域的某些部分会出现梯度消失的情况。

例如，在 $u \land_{\mathrm{luk}} v = \max(u+v-1,0)$ 中，如果 $u+v-1 < 0$，则梯度消失。

在下面的例子中，可以观察到一个极端情况，在这种情况下，Lukasiewicz 合取导致梯度消失。

In [4]:
x1 = torch.tensor(0.3, requires_grad=True) # requires_grad=True 启用梯度计算，这样在进行反向传播时可以计算它们的梯度。
x2 = torch.tensor(0.5, requires_grad=True)

y = and_luk(x1, x2)
y.backward()  # this is necessary to compute the gradients # 这是必要的，以计算梯度
# 调用 y.backward() 进行反向传播，计算 x1 和 x2 的梯度。这一步是必要的，因为我们启用了梯度计算。
res = y.item() # 使用 y.item() 获取张量 y 的标量值并赋值给 res。
gradients = [v.grad for v in [x1, x2]] # 使用列表推导式获取 x1 和 x2 的梯度，存储在 gradients 列表中。
# print the result of the aggregation # 打印聚合结果
print(res)
# print gradients of x1 and x2 # 打印x1和x2的梯度
print(gradients)

0.0
[tensor(0.), tensor(0.)]


#### 2. Single-Passing Gradients

Some operators have gradients propagating to only one input at a time, meaning that all other inputs will not benefit from learning at this step.

An example is the minimum aggregator, namely $\min(u_1,\dots,u_n)$.

In the following, it is possible to observe an edge case in which the `Min` aggregator leads to singe-passing gradients.

#### 2. 单次传递梯度

某些运算符的梯度在任意时刻只会传播到一个输入，这意味着所有其他输入在这一步骤中不会从学习中受益。

一个例子是最小值聚合器，即 $\min(u_1,\dots,u_n)$。

在下文中，可以观察到 `Min` 聚合器导致单次传递梯度的一个极端情况。

In [5]:
xs = torch.tensor([1., 1., 1., 0.5, 0.3, 0.2, 0.2, 0.1], requires_grad=True)

y = forall_min(xs, dim=0)
res = y.item()
y.backward()
gradients = xs.grad
# print the result of the aggregation # 打印聚合结果
print(res)
# print gradients of xs # 打印xs的梯度
print(gradients)

0.10000000149011612
tensor([0., 0., 0., 0., 0., 0., 0., 1.])


#### 3. Exploding Gradients

Some operators have exploding gradients on some part of their domains.

An example is the `PMean` aggregator, namely $\mathrm{pME}(u_1,\dots,u_n) = 1 - \biggl( \frac{1}{n} \sum\limits_{i=1}^n (1-u_i)^p \biggr)^{\frac{1}{p}}$.

In the edge case where all inputs are $1.0$, this operator leads to exploding gradients.

In the following, it is possible to observe this behavior.

#### 3. 梯度爆炸

某些运算符在其某些域上会出现梯度爆炸现象。

一个例子是 `PMean` 聚合器，即
$$
\mathrm{pME}(u_1,\dots,u_n) = 1 - \biggl( \frac{1}{n} \sum\limits_{i=1}^n (1-u_i)^p \biggr)^{\frac{1}{p}}
$$

在所有输入均为 $1.0$ 的边缘情况下，该运算符会导致梯度爆炸。

在下文中，可以观察到这种行为。

In [6]:
xs = torch.tensor([1., 1., 1.], requires_grad=True)

y = forall_pME(xs, dim=0, p=4)
res = y.item()
y.backward()
gradients = xs.grad
# print the result of the aggregation # 打印聚合结果
print(res)
# print the gradients of xs # 打印xs的梯度
print(gradients)

1.0
tensor([nan, nan, nan])


### Stable Product Configuration

#### Product Configuration

In general, we recommend using the following "product configuration" in LTN:
* not: the standard negation  $\lnot u = 1-u$,
* and: the product t-norm $u \land v = uv$,
* or: the product t-conorm (probabilistic sum) $u \lor v = u+v-uv$,
* implication: the Reichenbach implication $u \rightarrow v = 1 - u + uv$,
* existential quantification ("exists"): the generalized mean (p-mean) $\mathrm{pM}(u_1,\dots,u_n) = \biggl( \frac{1}{n} \sum\limits_{i=1}^n u_i^p \biggr)^{\frac{1}{p}} \qquad p \geq 1$,
* universal quantification ("for all"): the generalized mean of "the deviations w.r.t. the truth" (p-mean error) $\mathrm{pME}(u_1,\dots,u_n) = 1 - \biggl( \frac{1}{n} \sum\limits_{i=1}^n (1-u_i)^p \biggr)^{\frac{1}{p}} \qquad p \geq 1$.

#### "Stable"

As is, this "product configuration" is not fully exempt from issues:
- the product t-norm has vanishing gradients on the edge case $u=v=0$;
- the product t-conorm has vanishing gradients on the edge case $u=v=1$;
- the Reichenbach implication has vanishing gradients on the edge case $u=0$,$v=1$;
- `pMean` has exploding gradients on the edge case $u_1=\dots=u_n=0$;
- `pMeanError` has exploding gradients on the edge case $u_1=\dots=u_n=1$.

However, all these issues happen on edge cases and can easily be fixed using the following "trick":
- if the edge case happens when an input $u$ is $0$, we modify every input with $u' = (1-\epsilon)u+\epsilon$;
- if the edge case happens when an input $u$ is $1$, we modify every input with $u' = (1-\epsilon)u$;

where $\epsilon$ is a small positive value (e.g. $1\mathrm{e}{-5}$).

This "trick" gives us a stable version of such operators. Stable in the sense it has not gradient issues anymore.

One can trigger the stable version of such operators by using the boolean parameter `stable`. It is possible to set a default
value for `stable` when initializing the operator, or to use different values at each call of the operator.

In the following, we repeat the last example with the difference that we are now using the stable version of the `pMean`
operator. It is possible to observe that the gradients are now different from `NaN`. Thanks to the stable verison of the
operator, we are now able to obtain suitable gradients.

### 稳定的乘积配置

#### 乘积配置

一般来说，我们建议在 LTN 中使用以下“乘积配置”：
* 否定：标准的否定 $\lnot u = 1-u$，
* 与：乘积 t-范数 $u \land v = uv$，
* 或：乘积 t-余范数（概率和） $u \lor v = u+v-uv$，
* 蕴含：Reichenbach 蕴含 $u \rightarrow v = 1 - u + uv$，
* 存在量化（“存在”）：广义平均（p-平均） $\mathrm{pM}(u_1,\dots,u_n) = \biggl( \frac{1}{n} \sum\limits_{i=1}^n u_i^p \biggr)^{\frac{1}{p}} \qquad p \geq 1$，
* 全称量化（“对于所有”）：广义平均误差（p-平均误差） $\mathrm{pME}(u_1,\dots,u_n) = 1 - \biggl( \frac{1}{n} \sum\limits_{i=1}^n (1-u_i)^p \biggr)^{\frac{1}{p}} \qquad p \geq 1$。

#### “稳定”

按现状，这种“乘积配置”并非完全没有问题：
- 乘积 t-范数在边界情况 $u=v=0$ 时具有消失梯度；
- 乘积 t-余范数在边界情况 $u=v=1$ 时具有消失梯度；
- Reichenbach 蕴含在边界情况 $u=0, v=1$ 时具有消失梯度；
- `pMean` 在边界情况 $u_1=\dots=u_n=0$ 时具有爆炸梯度；
- `pMeanError` 在边界情况 $u_1=\dots=u_n=1$ 时具有爆炸梯度。

然而，所有这些问题都发生在边界情况下，并且可以使用以下“技巧”轻松修复：
- 如果边界情况发生在输入 $u$ 为 $0$ 时，我们修改每个输入为 $u' = (1-\epsilon)u+\epsilon$；
- 如果边界情况发生在输入 $u$ 为 $1$ 时，我们修改每个输入为 $u' = (1-\epsilon)u$；

其中 $\epsilon$ 是一个小的正值（例如 $1\mathrm{e}{-5}$）。

这个“技巧”使得我们能够获得这些算子的稳定版本。稳定的意思是不再有梯度问题。

可以通过使用布尔参数 `stable` 触发这些算子的稳定版本。在初始化算子时可以设置 `stable` 的默认值，或者在每次调用算子时使用不同的值。

在下面的示例中，我们重复了上一个示例，但现在我们使用了 `pMean` 算子的稳定版本。可以观察到梯度现在不再是 `NaN`。感谢算子的稳定版本，我们现在能够获得合适的梯度。

In [7]:
xs = torch.tensor([1., 1., 1.], requires_grad=True)

# the exploding gradient problem is solved # 爆炸梯度问题得到解决
y = forall_pME(xs, dim=0, p=4, stable=True)
res = y.item()
y.backward()
gradients = xs.grad
# print the result of the aggregation # 打印聚合结果
print(res)
# print the gradients of xs # 打印xs的梯度
print(gradients)

0.9998999834060669
tensor([0.3333, 0.3333, 0.3333])


#### The hyper-parameter $p$ in the generalized means

The hyper-parameter $p$ of `pMean` and `pMeanError` offers flexibility in writing more or less strict formulas, to
account for outliers in the data depending on the application. However, $p$ should be carefully set since it could have
strong implications for the training of LTN.

In the following, we see how a huge increase of $p$ leads to single-passing gradients in the `pMean` operator. This is
intuitive as in the second tutorial we have observed that `pMean` tends to the `Max` when $p$ tends to infinity. Similar
to the `Min` aggregator (seen before in this tutorial), the `Max` aggregator leads to single-passing gradients.

#### 广义均值中的超参数 $ p $

`pMean` 和 `pMeanError` 的超参数 $ p $ 提供了在编写严格程度不同的公式时的灵活性，以便根据应用情况处理数据中的异常值。然而， $ p $ 的设置应谨慎，因为它可能对 LTN 的训练产生重大影响。

在下文中，我们将看到 $ p $ 的大幅增加如何导致 `pMean` 操作符中的单次传递梯度。这是直观的，因为在第二个教程中我们已经观察到，当 $ p $ 趋向于无穷大时，`pMean` 趋向于 `Max`。与 `Min` 聚合器（在本教程前面部分看到的）类似，`Max` 聚合器也会导致单次传递梯度。

In [8]:
xs = torch.tensor([1., 1., 1., 0.5, 0.3, 0.2, 0.2, 0.1], requires_grad=True)

y = forall_pME(xs, dim=0, p=4)
res = y.item()
y.backward()
gradients = xs.grad
# print result of aggregation # 打印聚合结果
print(res)
# print gradients of xs # 打印xs的梯度
print(gradients)

0.31339913606643677
tensor([0.0000, 0.0000, 0.0000, 0.0483, 0.1325, 0.1977, 0.1977, 0.2815])


In [9]:
xs = torch.tensor([1., 1., 1., 0.5, 0.3, 0.2, 0.2, 0.1], requires_grad=True)

y = forall_pME(xs, dim=0, p=20)
res = y.item()
y.backward()
gradients = xs.grad
# print result of aggregation # 打印聚合结果
print(res)
# print gradients of xs # 打印xs的梯度
print(gradients)

0.18157517910003662
tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0734e-05, 6.4147e-03, 8.1100e-02,
        8.1100e-02, 7.6019e-01])


While it can be tempting to set a high value for $p$ when querying, in a learning setting, this can quickly lead to a "single-passing" operator that will focus too much on outliers at each step (i.e., gradients overfitting one input at this step, potentially harming the training of the others). We recommend not to set a too high $p$ when learning.


以下是您提供的内容的中文翻译：

尽管在查询时将 $p$ 值设置得很高可能很有诱惑力，但在学习环境中，这很快会导致“单次通过”操作符在每一步过多关注异常值（即在这一步中梯度过拟合一个输入，可能会损害其他输入的训练）。我们建议在学习时不要将 $p$ 值设置得过高。