What is the meaning of this snippet of code? Where does it likely stem from and why is it needed?

val attnMaskMul = sciCore.triangle(DataType.FLOAT32, seqLen, seqLen, 0.0, 1.0)
val attnMaskAdd = sciCore.triangle(DataType.FLOAT32, seqLen, seqLen, Double.NEGATIVE_INFINITY, 0.0)

att * attnMaskMul + attnMaskAdd
This code snippet is likely from an implementation of the attention mechanism in a deep learning model. Specifically, it is used to mask out certain elements in the attention scores matrix, which may correspond to invalid connections between elements in the input sequence.

The code first creates two matrices, attnMaskMul and attnMaskAdd, both of size seqLen by seqLen and with data type FLOAT32. These matrices are filled with values that form a triangular mask, where the values above the diagonal are set to 1.0 for attnMaskMul and 0.0 for attnMaskAdd, and the values below the diagonal are set to 0.0 for attnMaskMul and negative infinity for attnMaskAdd.

When the attention scores matrix att is multiplied element-wise with attnMaskMul, the resulting scores for invalid connections are set to 0.0, effectively masking them out. The attnMaskAdd is then added to the result, which ensures that the scores for valid connections remain the same, while the scores for invalid connections are set to negative infinity, which will be interpreted as a very low probability during the softmax operation that follows. This ensures that the attention distribution does not allocate any attention weight to invalid connections.

In summary, this code creates two triangular matrices to mask out invalid attention connections in a deep learning model's attention mechanism, and applies them to the attention scores matrix att.
When would this invalid connection form and why do we need to mask them? Is that a common occurrence?
The invalid connections are formed by the training process, and the triangle mask is shaped to mask them out. To understand why one may be needed, let's take a look at how and why triangle masks are used.

Let's assume the attention model above is an auto-regressive Transformer. Which means that prediction of next token only depends on it's previous tokens.

So in order to predict next token, you have to make sure that only previous tokens are attended. (If not, this would be somewhat like cheating while training because model already knows what's next).

So an attention mask would be like this
```block
[0, 1, 1, 1, 1]
[0, 0, 1, 1, 1]
[0, 0, 0, 1, 1]
[0, 0, 0, 0, 1]
[0, 0, 0, 0, 0]
```

For example: If you are translating English to Spanish
Input: How are you ?
Target: ```< start > Como estas ? < end >```
Then the decoder will predict something like this (start will be given to decoder as initial token)
```block
< start > 
< start > Como
< start > Como estas
< start > Como estas ?
< start > Como estas ? < end >
```

Now compare this step by step prediction sequences to attention mask matrix given above, see how the ones correspond to valid connections we want to pay attention to as the sequence is built, and the zeros correspond to the triangle we want to mask out, the invalid connections we want to remove from attention.  So, yes, in this sense, invalid connections are not only common, but inherent.