# **Advanced Transformer Applications**

**Transformers** have evolved beyond NLP, finding application in fields such as **computer vision, speech recognition, and reinforcement learning** due to their flexible self-attention architecture.

- **Vision Transformers (ViTs)** demonstrate that multi-head attention can replace CNNs in image analysis. By dividing an image into *patches*, the model treats them as sequences of data, allowing for a more global and adaptable representation.

- **In speech recognition**, Transformers process audio into spectrograms, capturing long-term dependencies in speech signals. Models such as **Wav2Vec** and **Speech Transformer** outperform traditional approaches by combining convolutional and Transformer layers to improve transcription.

- **In reinforcement learning**, Transformer models such as **Decision Transformer** leverage their ability to model sequences to predict actions based on past trajectories, improving learning efficiency.

This versatility makes Transformers essential tools for tackling complex problems beyond language processing, opening up new possibilities in areas such as visual perception, audio processing, and autonomous artificial intelligence.

### **Abstract: Transformers for Time Series Forecasting**

**Transformers** are revolutionizing **time series** forecasting, offering significant advantages over traditional models such as **ARIMA**, **RNN**, and **LSTM**.

### **Why use Transformers for Time Series?**
- **Captures long-term dependencies** better than RNN/LSTM.
- **Parallel processing**, accelerating training.
- **Handles variable-length sequences** and missing data with greater flexibility.

### **Model Architecture**
A Transformer model for time series follows this structure:
1. **Embedding Layer** → Converts sequences into dense vectors.
2. **Transformer Blocks** → Layers with **self-attention** and **feed-forward networks** to analyze relationships between data over time.
3. **Final Dense Layer** → Predicts the next value in the series.



In [44]:
import tensorflow as tf
from tensorflow.keras.layers import Layer, Dense, LayerNormalization, Dropout, MultiHeadAttention, Input, Embedding, Flatten, LSTM
from tensorflow.keras.models import Sequential, Model

In [32]:
# Create Transformer Block
class TransformerBlock(Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = Sequential([
            Dense(ff_dim, activation="relu"),
            Dense(embed_dim)
        ])
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)  # Corrected the second dropout layer

    def call(self, inputs, training, mask=None):
        attn_output = self.att(inputs, inputs, inputs, attention_mask=mask)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)  # Corrected the LayerNormalization call
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)  # Corrected the second dropout layer
        return self.layernorm2(out1 + ffn_output)  # Corrected the LayerNormalization call

### **Data Preparation**
- A **stock price** dataset is used.
- The data is **normalized** with **MinMaxScaler**.
- **Input sequences** are created with the previous values ​​and labeled with the next value.

In [33]:
%pip install pandas scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [34]:
import numpy as np
import pandas as pd 
from sklearn.preprocessing import MinMaxScaler
from IPython.display import display

In [35]:
# Load the dataset
data = pd.read_csv("stock_prices.csv")
data = data[["Close"]].values

display(data)

array([[100.99342831],
       [ 99.77349641],
       [101.3954271 ],
       ...,
       [198.13620067],
       [199.62384106],
       [198.51019471]])

In [36]:
# Normalize the Data
scaler = MinMaxScaler(feature_range=(0, 1))
data = scaler.fit_transform(data)

In [37]:
# Prepare the data for training
def create_dataset(data, time_step=1):
    X, Y = [], []
    for i in range(len(data) - time_step - 1):
        a = data[i:(i + time_step), 0] # Slice a sequence from a data
        X.append(a)
        Y.append(data[i + time_step, 0])
    return np.array(X), np.array(Y)

time_step = 60
X, Y = create_dataset(data, time_step)

In [38]:
print(f"Length of Data {data}")
print(f"Length of X {X}")
print(f"Shape of first Element in X:", X[0].shape if len(X) > 0 else "X is empty")

if len(X) > 0:
    X = X.reshape(X.shape[0], X.shape[1], 1)
    print(f"Shape of X after reshape {X.shape}")

print(f"Shape of X {X.shape}")
print(f"Shape of Y {Y.shape}")

Length of Data [[0.03884152]
 [0.02747753]
 [0.04258624]
 ...
 [0.94375214]
 [0.95760991]
 [0.947236  ]]
Length of X [[0.03884152 0.02747753 0.04258624 ... 0.05038847 0.06278687 0.07525617]
 [0.02747753 0.04258624 0.05936031 ... 0.06278687 0.07525617 0.04861998]
 [0.04258624 0.05936031 0.02708906 ... 0.07525617 0.04861998 0.05455432]
 ...
 [0.93648347 0.94093553 0.95291693 ... 0.97151402 0.96533161 0.97918741]
 [0.94093553 0.95291693 0.93497963 ... 0.96533161 0.97918741 0.95922183]
 [0.95291693 0.93497963 0.93935929 ... 0.97918741 0.95922183 0.94375214]]
Shape of first Element in X: (60,)
Shape of X after reshape (1939, 60, 1)
Shape of X (1939, 60, 1)
Shape of Y (1939,)


In [41]:
# Define the Transformer Model
input_shape = (X.shape[1], X.shape[2])
inputs = Input(shape=input_shape)

# Embedding layer
x = Dense(128)(inputs)

# Transformer Block
for _ in range(4):
    x = TransformerBlock(embed_dim=128, num_heads=4, ff_dim=512)(x, training=True)

# Output Layer
x = Flatten()(x)
outputs = Dense(1)(x)

# Create the model
model = Model(inputs, outputs)

model.summary()

This table represents the **Transformer model architecture** for time series forecasting, generated with the Keras `.summary()` method. Let's analyze row by row the layers and parameters involved.

---

### **1. General model structure**
- The model accepts inputs with shape **(None, 60, 1)** (60 time steps, 1 feature).
- After several Transformer layers, the output is transformed into a single value (the prediction of the next value in the time series).

---

### **2. Layers detail**

| **Layer (type)** | **Output Shape** | **Parameters (#)** | **Description** |
|---------------------------|--------------------|--------------------|------------------|
| **input_layer_8 (InputLayer)** | (None, 60, 1) | 0 | Input layer with temporal sequences of length 60 and 1 feature. |
| **dense_18 (Dense)** | (None, 60, 128) | 256 | Projects the input into a 128-dimensional space (embedding). |
| **transformer_block_7** (TransformerBlock) | (None, 60, 128) | 396,032 | First Transformer block with self-attention. |
| **transformer_block_8** (TransformerBlock) | (None, 60, 128) | 396,032 | Second Transformer block, continues processing the sequence. |
| **transformer_block_9** (TransformerBlock) | (None, 60, 128) | 396,032 | Third Transformer block, allows the model to learn deeper relationships. |
| **transformer_block_10** (TransformerBlock) | (None, 60, 128) | 396,032 | Fourth and final Transformer block, captures long-term dependencies. |
| **flatten (Flatten)** | (None, 7680) | 0 | Transforms the sequence into a one-dimensional vector for the output layer. |
| **dense_27 (Dense)** | (None, 1) | 7,681 | Final layer that generates the time series forecast. |

---

### **3. Number of parameters analysis**
- **Layer Dense (Embedding)**: It has **256 parameters**, because it transforms the input into a richer representation (input_dim=1, output_dim=128).
- **Each Transformer Block**: Contains **396,032 parameters**, because it includes:
- **Self-Attention** (multiple heads),
- **Feed Forward Network** (dimensionality expansion),
- **Norm & Dropout Layer**.
- **Flatten**: Has no parameters, transforms the data without learning anything.
- **Last Dense**: Has **7,681 parameters**, because it connects **7680 inputs** to **1 output**.

---




### **Model Training**
- **Optimizer:** Adam
- **Loss Function:** Mean Squared Error (MSE)
- The model is trained on the prepared data to learn the time trends.

In [42]:
# Compile the Model
model.compile(optimizer="Adam", loss="mse")

In [43]:
# Train the Model
model.fit(X, Y, epochs=20, batch_size=16)

Epoch 1/20
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 109ms/step - loss: 7.3802
Epoch 2/20
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 112ms/step - loss: 0.2023
Epoch 3/20
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 109ms/step - loss: 0.1404
Epoch 4/20
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 109ms/step - loss: 0.2066
Epoch 5/20
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 108ms/step - loss: 0.0734
Epoch 6/20
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 109ms/step - loss: 0.0882
Epoch 7/20
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 108ms/step - loss: 0.0654
Epoch 8/20
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 110ms/step - loss: 0.0342
Epoch 9/20
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 111ms/step - loss: 0.0265
Epoch 10/20
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[

<keras.src.callbacks.history.History at 0x1c6bb9b6260>


### **Evaluation and Forecasting**
- After training, the model is tested on new sequences.
- The predicted values ​​are **denormalized** to bring them back to the original scale.
- The results are displayed graphically to compare the predictions with the real data.

In [47]:
# Define the model 
model = Sequential()
model.add(LSTM(50, return_sequences=True, input_shape=(time_step, 1)))
# return_sequences=True: Indicates that the LSTM layer should return the entire output sequence for each input.
model.add(LSTM(50, return_sequences=False))
model.add(Dense(1))

  super().__init__(**kwargs)


In [48]:
# Compile the model
model.compile(optimizer="adam", loss="mean_squared_error")

In [49]:
# Train the model
model.fit(X, Y, epochs=10, batch_size=32)

Epoch 1/10
[1m61/61[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 77ms/step - loss: 0.0842
Epoch 2/10
[1m61/61[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 346ms/step - loss: 4.3960e-04
Epoch 3/10
[1m61/61[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 99ms/step - loss: 3.5800e-04
Epoch 4/10
[1m61/61[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 88ms/step - loss: 3.5796e-04
Epoch 5/10
[1m61/61[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 68ms/step - loss: 3.9438e-04
Epoch 6/10
[1m61/61[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 60ms/step - loss: 3.8814e-04
Epoch 7/10
[1m61/61[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 54ms/step - loss: 3.9883e-04
Epoch 8/10
[1m61/61[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 49ms/step - loss: 3.8917e-04
Epoch 9/10
[1m61/61[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 45ms/step - loss: 3.8923e-04
Epoch 10/10
[1m61/61[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m 

<keras.src.callbacks.history.History at 0x1c6c1fddc00>

In [None]:
# Make predictions
predictions = model.predict(X)

# Inverse transform the predictions to get the original scale
predictions = scaler.inverse_transform(predictions)

import matplotlib.pyplot as plt

plt.plot(scaler.inverse_transform(data), label="True Data")
plt.plot(np.arange(time_step, time_step + len(predictions)), predictions, label="Predictions")

plt.xlabel("Time")
plt.ylabel("Stock Prices")
plt.legend()
plt.show()