# Model Report: Linear Regression and LSTM with Self-Attention


In this report, we will discuss two powerful machine learning models used for predicting time-series data: Linear Regression and LSTM (Long Short-Term Memory) with an integrated Self-Attention mechanism. These models are commonly applied in forecasting, financial prediction, and other domains that require understanding and predicting sequential data.

## 1. Linear Regression Model
### 1.1 Overview

Linear regression is a widely-used statistical technique that models the relationship between a dependent variable 
𝑦
 and one or more independent variables 
𝑋
 by fitting a linear equation to the observed data. It assumes that the relationship between the variables can be approximated by a straight line.



### 1.2 Mathematical Formula
The linear regression model can be represented as:

$$
y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n + \epsilon
$$

Where:
- $ y $ is the dependent variable.
- $ X_1, X_2, \dots, X_n $ are the independent variables (features).
- $ \beta_0 $ is the intercept (bias term).
- $ \beta_1, \dots, \beta_n $ are the coefficients (weights) for each feature.
- $ \epsilon $ is the error term.

The model is trained by minimizing the **Mean Squared Error (MSE)**:

$$
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

Where:
- $ y_i $ is the true value of the dependent variable.
- $ \hat{y}_i $ is the predicted value of the dependent variable.

### 1.3 Model Implementation
In Python, we implement the Linear Regression model

In [None]:
from sklearn.linear_model import LinearRegression

def Regression_model():
    model = LinearRegression()
    return model


1.4 Training and Optimization
The model is trained by minimizing the **Mean Squared Error (MSE)**:

$$
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

Where:
- $ y_i $ is the true value of the dependent variable.
- $\hat{y}_i $ is the predicted value of the dependent variable.







# 2. LSTM Model with Self-Attention
### 2.1 Overview
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network (RNN) designed to handle sequential data. LSTM networks have the ability to learn and retain long-term dependencies in sequential data, making them well-suited for time-series forecasting, language modeling, and other sequence prediction tasks.

The addition of Self-Attention mechanisms enables the model to focus on important parts of the input sequence, improving its ability to learn relationships across different time steps.

### 2.2 Mathematical Formulation
#### 2.2.1 LSTM Cell
The LSTM cell is defined by the following equations that regulate the flow of information through its gates:



$$
f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)
$$
$$
i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)
$$
$$
\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)
$$
$$
C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C}_t
$$
$$
o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)
$$
$$
h_t = o_t \cdot \tanh(C_t)
$$

Where:
- $ f_t $ is the forget gate.
- $ i_t $ is the input gate.
- $ \tilde{C}_t $ is the candidate memory cell.
- $ C_t $ is the memory cell state.
- $ o_t $ is the output gate.
- $ h_t $ is the output.

#### Self-Attention

The Self-Attention mechanism computes attention scores for each input token, allowing the model to focus on different parts of the sequence at each time step. The attention mechanism is defined as:

$$
Q = XW_Q, \quad K = XW_K, \quad V = XW_V
$$
$$
\text{Attention Scores} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)
$$
$$
\text{Output} = \text{Attention Scores} \cdot V
$$

Where:
- $ Q $, $ K $, and $ V $ represent queries, keys, and values, respectively.
- $ d_k $ is the dimension of the key vector.
- The softmax function ensures that the attention scores sum to 1, making them interpretable as probabilities.

## 2.3 Model Architecture
The model consists of an LSTM layer followed by a Self-Attention layer and a Dense output layer. The LSTM layer captures temporal dependencies, while the Self-Attention layer allows the model to focus on relevant parts of the sequence.

In [1]:
import tensorflow as tf
from tensorflow.keras.layers import LSTM, Dense, Input, Flatten
from tensorflow.keras import Model
from tensorflow.keras.initializers import GlorotUniform

# Self Attention Layer
class SelfAttention(tf.keras.layers.Layer):
    def __init__(self, **kwargs):
        super(SelfAttention, self).__init__(**kwargs)
    
    def build(self, input_shape):
        self.W_q = self.add_weight(shape=(input_shape[-1], input_shape[-1]), initializer=GlorotUniform(), trainable=True)
        self.W_k = self.add_weight(shape=(input_shape[-1], input_shape[-1]), initializer=GlorotUniform(), trainable=True)
        self.W_v = self.add_weight(shape=(input_shape[-1], input_shape[-1]), initializer=GlorotUniform(), trainable=True)
        super(SelfAttention, self).build(input_shape)
    
    def call(self, inputs):
        Q = tf.matmul(inputs, self.W_q)
        K = tf.matmul(inputs, self.W_k)
        V = tf.matmul(inputs, self.W_v)
        attention_scores = tf.nn.softmax(tf.matmul(Q, K, transpose_b=True) / tf.sqrt(tf.cast(K.shape[-1], tf.float32)))
        return tf.matmul(attention_scores, V)

# Model construction
def build_LSTM(input_shape):
    input_layer = Input(shape=input_shape)
    lstm_out = LSTM(50, return_sequences=True)(input_layer)
    attention_out = SelfAttention()(lstm_out)
    flatten_out = Flatten()(attention_out)

    output_layer = Dense(1, activation='linear')(flatten_out)

    model = Model(inputs=input_layer, outputs=output_layer)
    model.compile(optimizer='adam', loss='mse')
    
    return model


### 2.4 Training and Optimization
The LSTM model is trained to minimize the Mean Squared Error (MSE), which is the same loss function used in the regression model. The optimizer used is Adam, a popular gradient-based optimizer known for its adaptive learning rate and efficiency in training deep learning models.

## 3. Conclusion
In this report, we have discussed two models: the Linear Regression model and the LSTM model with Self-Attention.

Linear Regression provides a simple and interpretable model for predicting continuous variables.
LSTM with Self-Attention offers an advanced method for time-series forecasting by capturing long-term dependencies and allowing the model to focus on relevant information through attention mechanisms.
Both models are useful for different tasks depending on the complexity of the data and the problem at hand. The Linear Regression model is a great starting point for problems with linear relationships, while the LSTM model is more suited for sequential or time-series data that requires capturing complex temporal dependencies.