## Supplementary Material
Deep Learning in EEG-Based BCIs: A Comprehensive Review of Transformer Models, Advantages, Challenges, and Applications


### EEGTransformer Class

The `EEGTransformer` class is designed to leverage a transformer-based architecture tailored specifically for Electroencephalogram (EEG) data processing.

#### Parameters:
- `num_channels` (int): Specifies the number of channels in the EEG dataset.
- `num_timepoints` (int): Indicates the number of time points or the sequence length in the EEG data.
- `output_dim` (int): Defines the output dimensionality for the classifier layer.
- `hidden_dim` (int): Specifies the hidden layer dimensionality.
- `num_heads` (int): Determines the number of attention heads to be used in the multi-head self-attention mechanism.
- `key_query_dim` (int): Denotes the dimensionality for the key/query pairs in the self-attention mechanism.
- `hidden_ffn_dim` (int): Indicates the hidden layer dimensionality for the feed-forward network.
- `intermediate_dim` (int): Refers to the dimensionality of the intermediate layer in the feed-forward network.
- `ffn_output_dim` (int): Specifies the output size of the feed-forward network.

#### Attributes:
- `positional_encoding` (torch.Tensor): A tensor of shape `(num_channels, num_timepoints)` that imparts the sequence position information.
- `multihead_attn` (nn.MultiheadAttention): Implements the multi-head self-attention mechanism.
- `ffn` (nn.Sequential): Constructs a feed-forward network composed of a linear transformation followed by ReLU activation and another linear transformation.
- `norm1` and `norm2` (nn.LayerNorm): Execute layer normalization.
- `classifier` (nn.Linear): Deploys a final linear transformation layer to categorize the input into designated classes.

#### Methods:
- `forward(X)`: Outlines the forward propagation for the model.
  - `X` (torch.Tensor): The input tensor for EEG data, which should have a shape of `(batch_size, num_channels, num_timepoints)`.

  - **Steps**:
    1. Standardize the input tensor.
    2. Apply positional encoding.
    3. Implement multi-head self-attention.
    4. Reshape the attention output and apply layer normalization.
    5. Forward the data through the feed-forward network.
    6. Flatten the resultant tensor and direct it through a classifier layer.
    7. Yield the final output.
  
### Notes:

- The model applies layer normalization after the multi-head self-attention and feed-forward network stages.
- Positional encoding is utilized to impart sequence position information to the model, which can either be relative or absolute.
- The classifier layer flattens the model output and categorizes it into `output_dim` classes.

### Usage:

To employ the `EEGTransformer` model, instantiate the class using the desired parameters. Then, similar to any other PyTorch model, forward the input data to the model and utilize the returned output for either training or inference.

```python
# Sample Usage
model = EEGTransformer(num_channels=32, num_timepoints=200, output_dim=2,
                       hidden_dim=512, num_heads=8, key_query_dim=512,
                       hidden_ffn_dim=512, intermediate_dim=2048,
                       ffn_output_dim=32)
                       
input_data = torch.randn(64, 32, 200)
output = model(input_data)
```

Ensure that the model is paired with a compatible loss function and optimizer for effective training. Depending on the specifics of the EEG dataset or application requirements, the model can be further refined.

In [16]:
import tensorflow as tf
import tensorflow.keras as keras    
from keras import layers

class EEGTransformer(keras.Model):
    """
    Tensroflow Keras model for EEG data using a Transformer architecture.
    This model is designed to process EEG data for tasks such as classification or regression.
    """
    def __init__(self, num_channels, num_timepoints, output_dim, 
                 num_heads, key_dim, ffn_intermediate_dim, dropout_rate=0.1,
                 name='EEGTransformer'):
        # Call the parent constructor
        super(EEGTransformer, self).__init__(name=name)

        # --- Store key parameters ---
        # Number of channels in the EEG data (also the embedding dimension)
        self.num_channels = num_channels
        # Number of time points in the sequence
        self.num_timepoints = num_timepoints
        # The number of output classes for the final classifier
        self.output_dim = output_dim

        # --- Positional Encoding ---
        # Create the positional encoding matrix using TensorFlow operations
        # This is a non-trainable part of the model
        # self.positional_encoding = self.build_positional_encoding()


        # --- Transformer Encoder Block ---
        # This block contains the core logic: Multi-Head Attention and Feed-Forward Network

        # 1. Multi-Head Self-Attention Layer
        # This layer learns the relationships between different time points

        self.multihead_attn = layers.MultiHeadAttention(
            num_heads=num_heads, 
            key_dim=key_dim,
            name='multihead_attention'
        ) 

        # 2. Layer Normalization for the attention block
        # Stabilizes the output of the attention layer
        self.norm1 = layers.LayerNormalization(epsilon=1e-6, name="layer_norm_1")
        
        # 3. Dropout for the attention block
        # A regularization technique to prevent overfitting
        self.dropout1 = layers.Dropout(dropout_rate)

        # 4. Position-wise Feed-Forward Network (FFN)
        # A simple two-layer MLP applied to each time point independently
        self.ffn = keras.Sequential(
            [
                layers.Dense(ffn_intermediate_dim, activation="relu"),
                layers.Dense(num_channels), # Project back to the original embedding dimension
            ],
            name="feed_forward_network"
        )

        # 5. Layer Normalization for the FFN block
        self.norm2 = layers.LayerNormalization(epsilon=1e-6, name="layer_norm_2")

        # 6. Dropout for the FFN block
        self.dropout2 = layers.Dropout(dropout_rate)

        # --- Final Classifier ---
        # This part takes the processed sequence and makes a final prediction

        # Flattens the output of the transformer block into a single vector per trial
        self.flatten = layers.Flatten()
        # A dense layer to classify the flattened features into the output classes
        self.classifier = layers.Dense(output_dim, name="classifier")

@classmethod
def build_positional_encoding(self):
    """
    Creates the sinusoidal positional encoding matrix.
    This is a direct TensorFlow implementation of the formula from the paper.
    """
    # Create an array of positions (0, 1, 2, ..., num_timepoints-1)
    positions = tf.range(start=0, limit=self.num_timepoints, delta=1, dtype=tf.float32)
    # Create an array of dimensions/channels (0, 1, 2, ..., num_channels-1)
    channels = tf.range(start=0, limit=self.num_channels, delta=1, dtype=tf.float32)

    # Calculate the angle rates for the sine/cosine functions
    # This is the 1 / (10000^(j/c)) part of the formula
    angle_rates = 1 / (10000 ** ((2 * (channels // 2)) / tf.cast(self.num_channels, tf.float32)))

    # Create the angle matrix by multiplying positions and rates
    angle_rads = positions[:, tf.newaxis] * angle_rates[tf.newaxis, :]

    # Apply sin to even indices in the array; 2i
    sines = tf.sin(angle_rads[:, 0::2])
    # Apply cos to odd indices in the array; 2i+1
    cosines = tf.cos(angle_rads[:, 1::2])

    # Interleave the sines and cosines to form the final encoding matrix
    pos_encoding = tf.concat([sines, cosines], axis=-1)
    # Add a batch dimension so it can be added to the input data
    pos_encoding = pos_encoding[tf.newaxis, ...]
    
    return tf.cast(pos_encoding, dtype=tf.float32)

@classmethod
def call(self, inputs):
    """
    Defines the forward pass of the EEGTransformer model.
    """
    # --- 1. Input Standadization ---
    # Standardize the input EEG data to have zero mean and unit variance
    mean = tf.reduce_mean(inputs, axis=2, keepdims=True)
    std = tf.math.reduce_std(inputs, axis=2, keepdims=True)
    # Apply Z-score normalization
    standardized_inputs = (inputs - mean) / (std + 1e-6)

    # --- 2. Add Positional Encoding ---
    # Add the positional encoding to the standardized inputs
    x = x + self.positional_encoding[:, :tf.shape(x)[2], :]

    # --- 3. Transformer Encoder Block ---
    # Keras's MultiHeadAttention expects (batch, sequence, features)
    # Our input is (batch, channels, timepoints). We need to transpose it. 
    x = tf.transpose(standardized_inputs, perm=[0, 2, 1])  # (batch, timepoints, channels)

    # --- Attention Sub-layer ---
    # Pass the data as query, key, and value to the multi-head attention layer
    attn_output = self.multihead_attn(query=x, value=x, key=x)
    # Add the attention output to the input (residual connection)
    x = self.norm1(x + self.dropout1(attn_output))

    # --- Feed-Forward Network Sub-layer ---
    # Pass the output through the feed-forward network
    ffn_output = self.ffn(x)
    # Apply dropout
    ffn_output = self.dropout2(ffn_output)
    # Apply the second skip-connection and layer normalization
    x = self.norm2(x + ffn_output)


    # --- 4. Final Classifier ---
    # Flatten the output to prepare for classification
    x = self.flatten(x)  # (batch, timepoints * channels)
    # Pass through the classifier to get the final output
    output = self.classifier(x)  # (batch, output_dim)

    return output


# --- Double-check of implemented functionalities ---
# 1. Input Standardization: Implemented in the `call` method.
# 2. Positional Encoding: Implemented in `build_positional_encoding` and added in `call`.
# 3. Multi-Head Attention: Implemented using `layers.MultiHeadAttention`.
# 4. Skip-Connections & Layer Norm: Implemented for both sub-layers (`x + attn_output`, `x + ffn_output`).
# 5. Position-wise FFN: Implemented using `keras.Sequential` with two Dense layers.
# 6. Final Classifier: Implemented using `layers.Flatten` and `layers.Dense`.
# All core functionalities from the PyTorch notebook have been re-implemented.

### 1. SETUP PARAMETERS (Mirrors the PyTorch notebook example)
This section defines the parameters for our synthetic data and model.

In [17]:
# --- Synthetic Data Parameters ---
# Set the number of channels in the synthetic EEG data.
num_channels: int = 32
# Set the number of time points (sequence length) in the synthetic EEG data.
num_timepoints: int = 200
# Set the number of trials (samples) in our synthetic batch.
batch_size: int = 64
# Set the number of output classes for our classification task.
output_dim: int = 2  # L=2 for binary classification

# --- Model Hyperparameters ---
# These values are taken directly from the PyTorch notebook's example instantiation.
# Number of attention heads in the Multi-Head Attention layer.
num_heads: int = 8
# Dimensionality of the key and query vectors in each attention head.
key_dim: int = 512
# The size of the hidden layer within the Feed-Forward Network (FFN).
ffn_intermediate_dim: int = 2048

# --- Training Parameters ---
# The learning rate for the Adam optimizer.
learning_rate: float = 0.001
# The number of epochs to train for.
epochs: int = 10

### 2. GENERATE SYNTHETIC DATA
This section creates random data to test the model, using TensorFlow.

In [18]:
import numpy as np

print("--- Generating synthetic EEG data... ---")

# Generate the input data 'X' with the shape (batch_size, num_channels, num_timepoints).
# tf.random.normal is the TensorFlow equivalent of torch.randn.
X: tf.Tensor = tf.random.normal((batch_size, num_channels, num_timepoints))

# Generate the integer labels 'y' for binary classification (labels are 0 or 1).
# np.random.randint is a straightforward way to create the labels.
y: np.ndarray = np.random.randint(0, output_dim, (batch_size,))

# Print the shapes to confirm they are correct.
print(f"Input data shape (X): {X.shape}")
print(f"Labels shape (y): {y.shape}")

--- Generating synthetic EEG data... ---
Input data shape (X): (64, 32, 200)
Labels shape (y): (64,)


### 3. INSTANTIATE AND COMPILE THE TENSORFLOW MODEL
This section creates an instance of your TensorFlow EEGTransformer and
prepares it for training.


In [19]:
print("\n--- Initializing the TensorFlow EEGTransformer model... ---")

# Instantiate your TensorFlow EEGTransformer class with the defined parameters.
model: keras.Model = EEGTransformer(
    num_channels=num_channels,
    num_timepoints=num_timepoints,
    output_dim=output_dim,
    num_heads=num_heads,
    key_dim=key_dim,
    ffn_intermediate_dim=ffn_intermediate_dim
)

# Define the optimizer. tf.keras.optimizers.Adam is the equivalent of torch.optim.Adam.
optimizer: keras.optimizers.Optimizer = keras.optimizers.Adam(learning_rate=learning_rate)

# Define the loss function.
# SparseCategoricalCrossentropy is used because our labels 'y' are integers (0, 1), not one-hot encoded.
# from_logits=True is crucial because our model's final layer outputs raw scores (logits), not probabilities.
loss_fn: keras.losses.Loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Compile the model. This configures the model with the optimizer, loss function,
# and any metrics we want to track during training.
model.compile(optimizer=optimizer, loss=loss_fn, metrics=['accuracy'])

# Build the model by passing a sample of data through it. This is necessary to
# print the summary.
model.build(input_shape=(None, num_channels, num_timepoints))

# Print a summary of the model's architecture and number of parameters.
model.summary()


--- Initializing the TensorFlow EEGTransformer model... ---


NotImplementedError: Unimplemented `tf.keras.Model.call()`: if you intend to create a `Model` with the Functional API, please provide `inputs` and `outputs` arguments. Otherwise, subclass `Model` with an overridden `call()` method.

In [None]:
# num_channels=32
# num_timepoints=200
# batch_size = 64

# X = torch.randn(batch_size, num_channels, num_timepoints)
# y = torch.randint(0, 2, (batch_size,))  # L=2 for binary classification

# # Model, Loss and Optimizer
# model = EEGTransformer(num_channels, num_timepoints, output_dim=2,
#                        hidden_dim=512, num_heads=8, key_query_dim=512,
#                        hidden_ffn_dim=512, intermediate_dim=2048,
#                        ffn_output_dim=num_channels)

# criterion = nn.CrossEntropyLoss()
# optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# # Training loop
# epochs = 10

# for epoch in range(epochs):
#     optimizer.zero_grad()
#     outputs = model(X)
#     loss = criterion(outputs, y)
#     loss.backward()
#     optimizer.step()
#     print(f"Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}")


# # once the model is trained, it can be tested on unseen EEG test examples
# # also, different model selection techniques (e.g. cross-validation methods) can be implemented within the training loop


### Transformer's Architecture for EEG Classification

This section presents the "standard" approach for utilizing the Transformer encoder to classify EEG patterns for BCIs.

#### Input Standardization and Positional Encoding

Let the set of pairs $ D_{\text{train}} = \{(\mathbf{X}_1,{y}_1),\dots, (\mathbf{X}_n,{y}_n)\} $ denote $ n $ trials of EEG recordings where $ {y}_i $ is the scaler class variable with $ L $ possible labels (e.g., target and non-target in a binary classification) and $ \mathbf{X}_i\in \mathbb{R}^{c\times p} $ is the collection of EEG observations in the $ i^{\text{th}} $ trial over $ c $ channels and $ p $ time points; that is to say,

$$ \mathbf{X}_i=[\mathbf{x}_{i1},\mathbf{x}_{i2},\ldots,\mathbf{x}_{ic}]^T, i=1,\ldots,n\,, $$
 
with $ \mathbf{x}_{ij}=[x_{ij1}, \ldots, x_{ijp}]^T \in \mathbb{R}^{p\times 1}, j=1,\ldots,c $, where $ x_{ijk}, k=1, \ldots, p $ denotes the $ k^{\text{th}} $ element of vector $ \mathbf{x}_{ij} $, and $ T $ denotes the transpose operator. The goal is to use $ D_{\text{train}} $ and train a classifier $ \psi: \mathbb{R}^{c\times p} \rightarrow \{0, 1, \ldots, L-1\} $ that maps a given $ \mathbf{X} $ to a possible value of the class variable.

It is common to apply standardization for each channel to make the sensory data across all channels comparable. 
In this regard, each $ \mathbf{X}_i $ is converted to $ \hat{\mathbf{X}}_i $ where 

$$ \hat{\mathbf{X}}_i=[\hat{\mathbf{x}}_{i1},\hat{\mathbf{x}}_{i2},\ldots,\hat{\mathbf{x}}_{ic}]^T, i=1,\ldots,n\,, $$

and where $ \hat{\mathbf{x}}_{ij} = [\hat{x}_{ij1}, \ldots, \hat{x}_{ijp}]^T $ such that 

$$ \hat{x}_{ijk} = \frac{{x}_{ijk}-m_{ij}}{s_{ij}}\,, $$

with $ m_{ij} $ and $ s_{ij} $ being the sample mean and sample standard deviation of vector $\mathbf{x}_{ij} $ given by

$$ m_{ij} = \frac{1}{p} \sum_{k=1}^p {x}_{ijk}\,, $$
$$ s_{ij} = \sqrt{\frac{1}{p} \sum_{k=1}^p ({x}_{ijk}-m_{ij})^2}\,, $$

respectively.

In order for the Transformer to make use of EEG recording orders, it is common to encode some information about the position of sequence elements in its input \cite{vaswani_attention_2017}. This positional encoding is generally realized by adding each $ \hat{\mathbf{X}}_i $ to a matrix $ \mathbf{P} \in \mathbb{R}^{c\times p} $ that is defined based on trigonometric functions with different frequencies for each channel \cite{vaswani_attention_2017}. As a result, we obtain

$$ \tilde{\mathbf{X}}_i = \hat{\mathbf{X}}_i + \mathbf{P}, \,i=1,\ldots, n, $$

where the element on row (channel) $ j=1,\ldots, c $, and column (time index) $ k=1, \ldots, p $, of $ \mathbf{P} $, denoted $ p_{jk} $ is given by

$$ p_{jk} = \begin{cases}
\text{

sin}\Big(k/10000^{j/c} \Big), & \text{for even } j \\
\text{cos}\Big(k/10000^{j-1/c} \Big), & \text{for odd } j
\end{cases} $$



#### Self-Attention Mechanisms: Capturing Contexts for EEG Classification

Capturing contexts is the essential concept that makes attention mechanism a promising operation for EEG classification. A context is simply another representation of an element of the input sequence (here one column of each $ \tilde{\mathbf{X}}_i $) based on its compatibility with other elements within the sequence. The most widely used attention operation for EEG classification is scaled dot-product self-attention, denoted $ \text{SA}^d_{\mathbf{V}, \mathbf{K}, \mathbf{Q}}(\tilde{\mathbf{X}}_i): \mathbb{R}^{c\times p} \rightarrow \mathbb{R}^{d\times p} $, which was initially proposed and used for translation tasks \cite{vaswani_attention_2017}. In particular,

$$ \text{SA}_{\mathbf{V}, \mathbf{K}, \mathbf{Q}}^d(\tilde{\mathbf{X}}_i) = \mathbf{V}\tilde{\mathbf{X}}_i\times\text{softmax}\Big(\frac{\tilde{\mathbf{X}}_i^T\mathbf{K}^T\mathbf{Q}\tilde{\mathbf{X}}_i}{\sqrt{q}}\Big)\,, $$

where $ \mathbf{V} \in \mathbb{R}^{d\times c} $, $ \mathbf{K} \in \mathbb{R}^{q\times c} $, $ \mathbf{Q} \in \mathbb{R}^{q\times c} $ are projection matrices that are learned in the training process, $ q $ is known as attention dimensionality, and  $ d $, which is generally a tuning parameter, denotes the dimensionality of the columns of the output matrix (context vectors). We use superscript $ d $ in $ \text{SA}_{\mathbf{V}, \mathbf{K}, \mathbf{Q}}^d(\tilde{\mathbf{X}}_i) $ to highlight the dimensionality of context vectors.



#### Multi-Head Self-Attention

Rather than a single self-attention operation, it is generally beneficial to apply multiple self-attentions in parallel. Using this operation, we view the compatibility of sequence elements using different learned projections. In this context, it is also common to refer to the output matrix of each self-attention as a head. In particular, the multi-head self-attention, denoted $ \text{MSHA}(\tilde{\mathbf{X}}_i) : \mathbb{R}^{c\times p} \rightarrow \mathbb{R}^{d_h\times p} $, is defined as

$$ \text{MSHA}^{d_h}(\tilde{\mathbf{X}}_i) = \mathbf{W}[\text{SA}^d_{\mathbf{V}_1, \mathbf{K}_1, \mathbf{Q}_1}(\tilde{\mathbf{X}}_i)^T, \ldots, \text{SA}^d_{\mathbf{V}_m, \mathbf{K}_m, \mathbf{Q}_m}(\tilde{\mathbf{X}}_i)^T]^T\,, $$

where $ \mathbf{W}\in \mathbb{R}^{d_h\times md} $ is another learnable projection matrix, $ m $ is the number of self-attentions used in (\ref{MHSA}), which is also known as the number of heads, and $ d_h $ is the dimensionality of columns in the output of $ \text{MSHA}^{d_h}(\tilde{\mathbf{X}}_i) $ operation.



#### Identity Skip-Connection and Layer Normalization

To ensure the stability and efficacy of the training process, especially with the complex nature of EEG data, the Transformer encoder utilizes identity skip-connections \cite{he_deep_2015} followed by layer normalization \cite{Jimmy_2016}. Here we define these operations. Let $ \text{SKP}\big({\text{LAY}(\mathbf{Y})}\big): \mathbb{R}^{a\times b}\rightarrow \mathbb{R}^{a\times b}$ denote the identity skip-connection around a layer $\text{LAY}(\mathbf{Y}) $ (an operation) that operates on an input $ \mathbf{Y} \in \mathbb{R}^{a\times b} $ to produce an output of the same size as the input. Then

$$ \text{SKP}\big({\text{LAY}(\mathbf{Y})}\big) = \mathbf{Y} + \text{LAY}(\mathbf{Y})\,. $$

That is to say, we simply add the output of $ \text{LAY}(\mathbf{Y}) $ to its input. 
Furthermore, let $\text{LN}(\mathbf{Y}):\mathbb{R}^{a\times b} \rightarrow \mathbb{R}^{a\times b}$ denote the layer normalization applied to an $ (a > 1)\times b $ matrix $ \mathbf{Y} $ with elements $ y_{jk}, j=1,\ldots,a, k=1,\ldots,b $ where each row records measurements for a "features" (here, channel). Then, $ \text{LN}(\mathbf{Y}) $ produces $ \mathring{\mathbf{Y}} $, which is a matrix of the same size $ \mathring{\mathbf{Y}} $ with elements $ \mathring{y}_{jk} $ where

$$ \mathring{y}_{jk} = \frac{{y}_{jk}-m_{k}}{s_{k}}\,, $$

and where

$$ m_{k} = \frac{1}{a} \sum_{j=1}^a {y}_{jk}\,, $$
$$ s_{k} = \sqrt{\frac{1}{a} \sum_{j=1}^a ({y}_{jk}-m_{k})^2}\,. $$

In other words, $ \mathring{\mathbf{Y}} $ is a type of standardization where the sample mean and sample standard deviation are computed for each column of $ \mathbf{Y} $ (in the EEG context means for each time point in the sequence) over all features. One place that these operations are used in the transformer encoder is to produce $ \mathring{\mathbf{X}}_i $ as follows:

$$ \mathring{\mathbf{X}}_i = \text{LN}\Big(\text{SKP}\big({\text{MSHA}^{c}(\tilde{\mathbf{X}}_i)}\big)\Big)\,; $$
 

that is, the skip-connection is used around the multi-head self-attention, which is then followed by layer normalization. Note that the use of skip-connection in (\ref{outSKPLN}) enforces setting $ d_h $ defined in (\ref{MHSA}) to $ c $, which is the number of channels.



#### Position-wise Feed-Forward Networks

The Transformer encoder utilizes a fully connected feed-forward network that transforms each element of a given sequence individually. Let $ \mathbf{Y} \in \mathbb{R}^{a\times b} $ be the generic matrix defined before. The effect of this position-wise feed-forward network operated on an input $ \mathbf{Y} $, denoted 
$\text{FFN}(\mathbf{Y})$, is:

$$ \text{FFN}^s(\mathbf{Y}) = [g(\mathbf{y}_1), \ldots, g(\mathbf{y}_b)]\,, $$
 

where $ \mathbf{y}_k, k=1,\ldots, b $ are columns of $ \mathbf{Y} $ and

$$ g(\mathbf{y}_k) = \mathbf{W}_2\times f(\mathbf{W}_1\mathbf{y}_k + \mathbf{b}_1) + \mathbf{b}_2\,, $$

where $ f(.) $ denotes an element-wise nonlinear activation function (e.g., ReLU), and $\mathbf{W}_1\in \mathbb{R}^{r\times a}$, and $\mathbf{W}_2 \in \mathbb{R}^{s\times r}$, and $\mathbf{b}_1 \in \mathbb{R}^{r\times 1}$, and $\mathbf{b}_1 \in \mathbb{R}^{s\times 1} $ are learnable matrices and vectors—$ r $ is generally a tuning parameter. 

We use superscript $ s $ in $ \text{FFN}^s(\mathbf{Y}) $ to highlight the dimensionality of output vectors in (\ref{FFN}). In the Transformer encoder, position-wise feed-forward network is used to produce an output $ {\mathbf{O}}_i $ from $ \mathring{\mathbf{X}}_i $ obtained in (\ref{outSKPLN}), which is then added to its input through the skip-connection, followed by layer normalization. This operation is characterized as follows:

$$ {\mathbf{O}}_i = \text{LN}\Big(\text{SKP}\big({\text{FFN}^{c}(\mathring{\mathbf{X}}_i)}\big)\Big)\,. $$
 

Note that the use of skip-connection in (\ref{outSKPLNN}) enforces setting $ s $ defined in (\ref{FFN}) to $ c $. The classification can be performed by vectorizing $ {\mathbf{O}}_i $ and using that as the input to a fully connected layer with a softmax activation function. 

In [1]:
def replace_latex_delimiters(text):
    # Replace \( and \) with $
    text = text.replace(r"\(", "$").replace(r"\)", "$")
    
    # Replace \[ and \] with $$
    text = text.replace(r"\[", "$$").replace(r"\]", "$$")
    
    return text


# Read input text
with open('eq.txt', 'r') as file:
    content = file.read()

# Apply the replacement
modified_content = replace_latex_delimiters(content)

 
with open('output.txt', 'w') as file:
    file.write(modified_content)
