**PART 3**

**ADVANCED DEEP NETWORKS FOR COMPLEX PROBLEMS**

---

**CHAPTER 13 - Transformers**

---

### **13.1 Transformers**

While sequence-to-sequence models like LSTMs and GRUs have been pivotal in NLP, they suffer from a significant bottleneck: sequential processing. They must process tokens one by one, which prevents parallelization and makes it difficult to retain information over long sequences. The **Transformer** architecture addresses this by discarding recurrence entirely and relying on the **Attention Mechanism** to process the entire sequence simultaneously. This allows the model to model dependencies between all words in a sentence, regardless of their distance.

![Figure 13.1 How Transformer models are used to solve NLP problems](Figure 13.1 How Transformer models are used to solve NLP problems)

#### **Revisiting the basic components of the Transformer**

The Transformer adheres to the standard **Encoder-Decoder** paradigm, but with a specific internal structure for each:

* **Encoder**: Its role is to ingest the input sequence and generate a rich, contextualized latent representation. It consists of a stack of $N$ identical layers. Each layer has two main sub-layers:
    1.  **Multi-Head Self-Attention**: This mechanism allows the model to associate each word with every other word in the sequence to understand context.
    2.  **Fully Connected Feed-Forward Network**: A standard dense network applied to each position independently.

* **Decoder**: Its role is to generate the target sequence token by token. It also consists of a stack of $N$ layers, but each layer has three sub-layers:
    1.  **Masked Self-Attention**: Ensures that the prediction for position $i$ can depend only on the known outputs at positions less than $i$ (preventing the model from "seeing the future").
    2.  **Encoder-Decoder Attention**: This layer performs attention over the encoder's output, allowing the decoder to focus on relevant parts of the input sentence.
    3.  **Fully Connected Feed-Forward Network**: Same as in the encoder.

**The Self-Attention Mechanism**:
Self-attention is the engine of the Transformer. For every token in the sequence, the model learns three vectors via trainable weight matrices: a **Query ($q$)**, a **Key ($k$)**, and a **Value ($v$)**. The attention score between two words is calculated by the dot product of the Query and Key vectors. The final output is a weighted sum of the Value vectors:

$$h = softmax(\frac{Q \cdot K^T}{\sqrt{d_k}}) \cdot V$$

The term $\sqrt{d_k}$ acts as a scaling factor to prevent the dot products from growing too large, which would push the softmax function into regions with extremely small gradients.

#### **Embeddings in the Transformer**

Because the Transformer processes tokens in parallel rather than sequentially, it has no inherent understanding of the order of words (unlike an RNN which processes $t_1$ before $t_2$). To restore this information, the Transformer adds specific positional information to the input embeddings.

1.  **Token Embeddings**: These are standard learned vector representations for each token in the vocabulary, capturing semantic meaning (similar to Word2Vec).
2.  **Positional Embeddings**: These are vectors of the same dimension as token embeddings but contain information about the *position* of the token in the sequence. In the original paper, these are not learned but are generated using fixed sine and cosine functions of different frequencies:
    $$PE(pos, 2i) = sin(pos/10000^{2i/d_{model}})$$
    $$PE(pos, 2i+1) = cos(pos/10000^{2i/d_{model}})$$
    Here, $pos$ is the position and $i$ is the dimension. This allows the model to easily learn to attend by relative positions.

The final input to the encoder is the element-wise sum of the Token Embeddings and the Positional Embeddings.

![Figure 13.2 How positional embeddings change with the time step and the feature position](Figure 13.2 How positional embeddings change with the time step and the feature position)

![Figure 13.3 The embeddings generated in a Transformer model and how the final embeddings are computed](Figure 13.3 The embeddings generated in a Transformer model and how the final embeddings are computed)

#### **Residuals and normalization**

Training very deep neural networks is difficult due to the vanishing gradient problem. The Transformer employs two key architectural features in every sub-layer to mitigate this and ensure training stability:

* **Residual Connections (Skip Connections)**: The input of a sub-layer is added to its output: $Output = x + Sublayer(x)$. This creates a direct path for gradients to flow backward through the network, preventing them from diminishing.
* **Layer Normalization**: Unlike Batch Normalization, which normalizes across the batch dimension, Layer Normalization normalizes the inputs across the *feature* dimension for each sample independently. This is crucial for NLP tasks where sequence lengths vary and batch statistics can be unstable. The normalization is applied after the residual addition.

![Figure 13.5 How residual connections and layer normalization layers are used in the Transformer model](Figure 13.5 How residual connections and layer normalization layers are used in the Transformer model)

### **13.2 Using pretrained BERT for spam classification**

#### **Understanding BERT**

**BERT (Bidirectional Encoder Representations from Transformers)** represents a landmark shift in NLP known as the "ImageNet moment." It is essentially the **Encoder** stack of the Transformer architecture, pretrained on massive amounts of unlabelled text. This pretraining allows BERT to learn deep, bidirectional representations of language that can be fine-tuned for specific downstream tasks with relatively small labeled datasets.

![Figure 13.6 The high-level architecture of BERT](Figure 13.6 The high-level architecture of BERT)

**Input Representation**:
BERT requires specific formatting for its inputs:
* **`[CLS]` Token**: A special classification token inserted at the beginning of every sequence. The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks (e.g., spam detection).
* **`[SEP]` Token**: A separator token used to distinguish between two sentences (e.g., for Question Answering or Next Sentence Prediction).
* **Segment Embeddings**: In addition to token and positional embeddings, BERT uses segment embeddings to explicitly signal whether a token belongs to the first sentence (Sentence A) or the second (Sentence B).

**Pretraining Tasks**:
BERT is trained on two self-supervised tasks simultaneously:
1.  **Masked Language Modeling (MLM)**: 15% of the input tokens are randomly masked (replaced with a `[MASK]` token), and the model must predict the original token based on the context. This forces the model to learn bidirectional context.
2.  **Next-Sentence Prediction (NSP)**: The model is fed pairs of sentences and must predict whether the second sentence actually follows the first in the original text. This helps the model understand relationships between sentences.

![Figure 13.7 The methodology used for pretraining BERT](Figure 13.7 The methodology used for pretraining BERT)

#### **Classifying spam with BERT in TensorFlow**

We will implement a spam classifier by fine-tuning a pretrained BERT model from TensorFlow Hub on the SMS Spam Collection dataset.

**1. Loading Data and Handling Imbalance**
The dataset contains SMS messages labeled as 'ham' or 'spam'. Since 'spam' is the minority class, we must handle the class imbalance to prevent the model from becoming biased. We use the `imbalanced-learn` library to perform undersampling.
* **Test/Validation Sets**: Created via Random Undersampling to ensure balanced evaluation.
* **Training Set**: Created using the **NearMiss** algorithm. NearMiss selects samples from the majority class (ham) that are mathematically closest to the minority class (spam), forcing the model to learn the difficult decision boundary.

In [None]:
import os
import numpy as np
from imblearn.under_sampling import RandomUnderSampler, NearMiss
from sklearn.feature_extraction.text import CountVectorizer

# 1. Load Data
inputs = []
labels = []
with open(os.path.join('data', 'SMSSpamCollection'), 'r') as f:
    for r in f:
        if r.startswith('ham'):
            inputs.append(r[4:])
            labels.append(0)
        elif r.startswith('spam'):
            inputs.append(r[5:])
            labels.append(1)

inputs = np.array(inputs).reshape(-1, 1)
labels = np.array(labels)

# 2. Undersampling for Test/Validation (Random)
rus = RandomUnderSampler(sampling_strategy={0: 100, 1: 100}, random_state=4321)
inputs_res, labels_res = rus.fit_resample(inputs, labels)
test_inds = rus.sample_indices_
test_x, test_y = inputs[test_inds], labels[test_inds]

# (Logic to separate validation set would go here...)

# 3. Undersampling for Training (NearMiss)
# We convert text to Bag-of-Words first because NearMiss requires numerical input
countvec = CountVectorizer()
train_bow = countvec.fit_transform(train_x.reshape(-1).tolist())
nm = NearMiss()
train_x_res, train_y_res = nm.fit_resample(train_bow, train_y)

**2. Defining the BERT Model**
We use `tensorflow_text` and `tensorflow_models` to integrate BERT. The process involves:
* **Tokenizer**: We use `FastWordpieceBertTokenizer` to break text into sub-words (e.g., "walking" -> "walk" + "##ing"). This handles out-of-vocabulary words effectively.
* **Input Formatting**: `BertPackInputs` automatically adds the `[CLS]` and `[SEP]` tokens and generates the required masks.
* **Encoder**: We download the pretrained BERT encoder (`bert_en_uncased_L-12_H-768_A-12`) from TensorFlow Hub.
* **Classifier**: We use the `BertClassifier` class, which conveniently wraps the encoder and adds a classification head (dropout + dense layer) on top of the `[CLS]` token's output.

**3. Compilation and Training**
We use the **Adam** optimizer with a specific learning rate schedule: a linear warmup followed by polynomial decay. This is standard for training Transformer models to prevent instability in early training.

In [None]:
import tensorflow_hub as hub
import tensorflow_models as tfm
import tensorflow as tf

# Define Inputs
max_seq_length = 60
input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_word_ids")
input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_mask")
input_type_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32, name="input_type_ids")

# Load BERT Encoder
hub_bert_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4"
bert_layer = hub.KerasLayer(hub_bert_url, trainable=True)
encoder_outputs = bert_layer({
    "input_word_ids": input_word_ids,
    "input_mask": input_mask,
    "input_type_ids": input_type_ids
})

# Create Encoder Model
hub_encoder = tf.keras.models.Model(
    inputs={'input_word_ids': input_word_ids, 'input_mask': input_mask, 'input_type_ids': input_type_ids},
    outputs={'sequence_output': encoder_outputs["sequence_output"], 'pooled_output': encoder_outputs["pooled_output"]}
)

# Create Classifier
bert_classifier = tfm.nlp.models.BertClassifier(network=hub_encoder, num_classes=2)

# Compilation with Warmup Schedule
optimizer = tf.keras.optimizers.experimental.Adam(learning_rate=3e-5)
bert_classifier.compile(
    optimizer=optimizer,
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')]
)

# Training
# bert_classifier.fit(train_inputs, train_y, ...)

### **13.3 Question answering with Hugging Faceâ€™s Transformers**

Hugging Face's `transformers` library provides a high-level API to easily access and fine-tune state-of-the-art models. We will use it to build a Question Answering (QA) system using **DistilBERT** (a smaller, faster, cheaper version of BERT trained via knowledge distillation).

#### **Understanding the data**
We use the **SQuAD v1** (Stanford Question Answering Dataset). The objective is: given a **context** paragraph and a **question**, predict the **answer** text span within the context.
The dataset provides:
* Context text
* Question text
* Answer text
* `answer_start`: The starting *character* index of the answer in the context.

#### **Processing data**
Transformers process tokens, not characters. Therefore, a crucial preprocessing step is mapping the provided character indices to token indices.
1.  **Cleaning**: We perform a sanity check to fix known offset issues in the SQuAD dataset where the provided index might be off by a few characters.
2.  **Tokenization**: We use `DistilBertTokenizerFast`. This "Fast" tokenizer is backed by Rust and provides a `char_to_token()` method, which is essential for mapping our answer character spans to the corresponding token spans.
3.  **Data Pipeline**: We construct a `tf.data.Dataset` generator to efficiently feed batches of (Input IDs, Attention Mask) and (Start Token Index, End Token Index) to the model.

In [None]:
from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

def update_char_to_token_positions_inplace(encodings, answers):
    start_positions = []
    end_positions = []
    for i in range(len(answers)):
        # Convert character index to token index
        # We use the Fast Tokenizer's char_to_token method
        start_pos = encodings.char_to_token(i, answers[i]['answer_start'])
        end_pos = encodings.char_to_token(i, answers[i]['answer_end'] - 1)
        
        # Handle cases where the answer might have been truncated due to max length
        if start_pos is None:
            start_pos = tokenizer.model_max_length
        if end_pos is None:
            end_pos = tokenizer.model_max_length
            
        start_positions.append(start_pos)
        end_positions.append(end_pos)
            
    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

#### **Defining the DistilBERT model**
We use the `TFDistilBertForQuestionAnswering` class. This is a pre-built architecture that:
1.  Uses the DistilBERT encoder to get hidden states for every token.
2.  Adds two classification heads on top: one to predict the **Start Logits** (probability of being the start token) and one for **End Logits**.

**Important Implementation Note**: Hugging Face models return a custom `ModelOutput` object, which is not directly compatible with the standard Keras `model.fit()` loop (which expects tuples or tensors). To fix this, we wrap the Hugging Face model inside a custom `tf.keras.Model` that manually extracts the start and end logits.

#### **Training the model**
We compile the wrapped model with `SparseCategoricalCrossentropy` (from_logits=True) since the targets are integer indices of the start and end tokens. We train using `model.fit()`.

In [None]:
from transformers import TFDistilBertForQuestionAnswering
import tensorflow as tf

# 1. Download Pretrained Model
hf_model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")

# 2. Wrap model to be Keras-compatible
def tf_wrap_model(model):
    input_ids = tf.keras.layers.Input([None,], dtype=tf.int32, name="input_ids")
    attention_mask = tf.keras.layers.Input([None,], dtype=tf.int32, name="attention_mask")
    
    # Forward pass through HF model
    out = model([input_ids, attention_mask])
    
    # Return logits for start and end
    return tf.keras.models.Model(
        inputs=[input_ids, attention_mask],
        outputs=(out.start_logits, out.end_logits)
    )

model_v2 = tf_wrap_model(hf_model)

# 3. Compile
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)
model_v2.compile(optimizer=optimizer, loss=loss, metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])

#### **Ask BERT a question**
Once trained, we can perform inference.
1.  **Encode**: Pass the question and context to the tokenizer.
2.  **Predict**: The model returns logits for every token being the start or end.
3.  **Decode**: We take the `argmax` of the logits to find the most likely start and end indices. We then slice the input tokens using these indices and decode them back into a string to get the final answer.

In [None]:
def ask_bert(sample_input, tokenizer, model):
    # Predict logits
    out = model.predict(sample_input)
    
    # Get most likely start and end tokens
    pred_ans_start = tf.argmax(out[0][0])
    pred_ans_end = tf.argmax(out[1][0])
    
    # Extract the answer tokens from the input
    # sample_input[0][0] contains the input_ids
    ans_tokens = sample_input[0][0][pred_ans_start : pred_ans_end + 1]
    
    # Convert ids back to string
    return tokenizer.decode(ans_tokens)

# Example usage
# answer = ask_bert(processed_sample_input, tokenizer, model_v2)
# print(answer)