# Detailed Tutorial: Transformer from Scratch for Tabular Data

---

## 1. Understanding Transformers

Before we dive into the code, it's important to understand what a **Transformer** is and why it works so well for various tasks, including tabular data.

The **Transformer** architecture was introduced in the paper ["Attention is All You Need"](https://arxiv.org/abs/1706.03762) and revolutionized tasks like NLP, but can be applied to other types of data too.

### Key Concepts:

- **Self-Attention**: A mechanism that allows the model to weigh the importance of different input tokens/positions for each input token.
- **Multi-Head Attention**: Allows the model to focus on different parts of the input simultaneously.
- **Feed Forward Network (FFN)**: After attention, the transformed inputs pass through fully connected layers to learn further representations.
- **Layer Normalization**: Normalizes the input to each layer for better training stability.
- **Residual Connections**: Adds input back to the output of a layer, preventing the loss of information during transformations.

### Why Use Transformers for Tabular Data?
- Tabular data often contains both categorical and numerical features.
- A transformer’s ability to learn **relationships** between features can be helpful for structured data.
- It allows each feature to "attend" to other features.

---

## 2. Implementing Multi-Head Self-Attention

### What is Self-Attention?

Self-attention allows each feature (or token in NLP) to focus on other features while processing, providing a context. For example, in a tabular dataset, if we are processing the "age" feature, self-attention helps the model figure out the influence of other features (like "blood pressure" or "cholesterol") on the target.

### Multi-Head Self-Attention:

Instead of one attention mechanism, **multi-head attention** performs attention in parallel across multiple heads, learning different aspects of relationships in data.

```python
import tensorflow as tf
from tensorflow.keras import layers

class MultiHeadSelfAttention(layers.Layer):
    def __init__(self, embed_dim, num_heads=8):
        super(MultiHeadSelfAttention, self).__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        
        # Ensure that embedding dimension is divisible by number of heads
        if embed_dim % num_heads != 0:
            raise ValueError(f"Embedding dimension {embed_dim} should be divisible by number of heads {num_heads}.")
        
        self.projection_dim = embed_dim // num_heads
        self.query_dense = layers.Dense(embed_dim)
        self.key_dense = layers.Dense(embed_dim)
        self.value_dense = layers.Dense(embed_dim)
        self.combine_heads = layers.Dense(embed_dim)
    
    def attention(self, query, key, value):
        score = tf.matmul(query, key, transpose_b=True)
        dim_key = tf.cast(tf.shape(key)[-1], tf.float32)
        scaled_score = score / tf.math.sqrt(dim_key)
        weights = tf.nn.softmax(scaled_score, axis=-1)
        output = tf.matmul(weights, value)
        return output, weights

    def separate_heads(self, x, batch_size):
        # Split into multiple heads
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.projection_dim))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, inputs):
        batch_size = tf.shape(inputs)[0]
        
        # Linear projections for query, key, and value
        query = self.query_dense(inputs)
        key = self.key_dense(inputs)
        value = self.value_dense(inputs)
        
        # Split and perform attention
        query = self.separate_heads(query, batch_size)
        key = self.separate_heads(key, batch_size)
        value = self.separate_heads(value, batch_size)
        attention_output, _ = self.attention(query, key, value)
        
        # Concatenate and project output
        attention_output = tf.transpose(attention_output, perm=[0, 2, 1, 3])
        concat_attention = tf.reshape(attention_output, (batch_size, -1, self.embed_dim))
        output = self.combine_heads(concat_attention)
        return output
```
###Explanation:
* Query, Key, Value: The core components of attention. Query is what we're currently focusing on, Key is a reference to other parts of the data, and Value is the information associated with those keys.
* Matmul and Softmax: We compute a similarity score between query and key using matrix multiplication, then normalize it with softmax.
* Multi-Head: Multiple attention mechanisms are applied in parallel, allowing the model to focus on different aspects of the data.
* Combine Heads: After processing with multiple heads, we combine the results into one.


## 3. Building the Transformer Block

The **Transformer Block** combines multi-head attention with a feed-forward network. After each step, we normalize the output and add **dropout** for regularization.

```python
class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = MultiHeadSelfAttention(embed_dim, num_heads)
        self.ffn = tf.keras.Sequential([
            layers.Dense(ff_dim, activation='relu'),
            layers.Dense(embed_dim)
        ])
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)
```
### Explanation:
* Layer Normalization: Normalizes inputs, which helps in stable and faster training.
* Feed-Forward Network (FFN): Fully connected layers (Dense) that learn more abstract features from the output of the attention mechanism.
* Residual Connection: By adding the input back into the output, we prevent information from being lost.

In [None]:
class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = MultiHeadSelfAttention(embed_dim, num_heads)
        self.ffn = tf.keras.Sequential([
            layers.Dense(ff_dim, activation='relu'),
            layers.Dense(embed_dim)
        ])
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)

        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

## 4. Data Preparation

We will now prepare a **tabular dataset**. In this example, we use the heart disease dataset from TensorFlow, focusing on numerical columns.

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Example dataset (replace with your actual dataset)
url = "https://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
df = pd.read_csv(url)

# Define numerical and categorical columns
numerical_cols = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
target_col = 'target'

# Process the features and labels
X = df[numerical_cols]
y = df[target_col]

# Standardize numerical features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Encode target labels
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
```
###Explanation:
* Scaling: We normalize the data to have mean 0 and standard deviation 1 to ensure that features are on the same scale.
* Encoding: The target is converted to numerical values.
* Splitting: We divide the dataset into training and validation sets, with 80% training and 20% validation.

## 5. Building the Transformer Model

The model integrates the **Transformer Block** with a feed-forward classifier for tabular data.

```python
def build_transformer_model(input_shape, num_classes):
    inputs = layers.Input(shape=input_shape)
    
    # Add Transformer block
    transformer_block = TransformerBlock(embed_dim=32, num_heads=4, ff_dim=64)
    x = transformer_block(inputs, training=True)
    
    # Global pooling and dense layers
    x = layers.GlobalAveragePooling1D()(x)
    x = layers.Dense(32, activation="relu")(x)
    x = layers.Dropout(0.1)(x)
    x = layers.Dense(16, activation="relu")(x)
    outputs = layers.Dense(num_classes, activation="softmax")(x)
    
    model = tf.keras.Model(inputs=inputs, outputs=outputs)
    return model

# Get the input shape (number of features) and number of classes
input_shape = (X_train.shape[1], 1)  # Add a dummy dimension for time step
num_classes = len(set(y_train))

# Build and compile the model
model = build_transformer_model(input_shape=input_shape, num_classes=num_classes)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
```
###Explanation:
* Global Pooling: Reduces the output size by averaging across all tokens, which is essential for reducing dimensionality in tabular data.
* Dense Layers: Adds fully connected layers to output the final classification result.

## 6. Training and Plotting the Loss

Finally, we train the model and visualize how the loss evolves over time.

```python
# Reshape X_train and X_val to match input shape for the transformer
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_val = X_val.reshape((X_val.shape[0], X_val.shape[1], 1))

# Train the model
history = model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=50, batch_size=32)

# Plot the training and validation loss
import matplotlib.pyplot as plt

plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Training and Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
```
###Explanation:
* Reshape: We reshape the data to match the expected input for the transformer.
* Training: We train the model and track the loss for both training and validation data.
* Plotting: Loss curves help monitor the model’s performance and spot any overfitting.