#### The Transformer architecture
Starting in 2017, a new model architecture started overtaking recurrent neural networks across most natural language processing tasks: the **Transformer**. <br>
Transformers were introduced in the seminal paper “Attention is all you need” by Vaswani et al. The gist of the paper is right there in the title: as it turned out, a simple mechanism called “neural attention” could be used to build powerful sequence models that didn’t feature any recurrent layers or convolution layers. <br>
This finding unleashed nothing short of a revolution in natural language processing— and beyond. Neural attention has fast become one of the most influential ideas in deep learning. In this section, you’ll get an in-depth explanation of how it works and why it has proven so effective for sequence data. We’ll then leverage **self-attention** to create a **Transformer encoder**, one of the basic components of the **Transformer** architecture, and we’ll apply it to the IMDB movie review classification task.

##### Understanding self-attention
As you’re going through this book, you may be skimming some parts and attentively reading others, depending on what your goals or interests are. What if your models did the same? It’s a simple yet powerful idea: not all input information seen by a model is equally important to the task at hand, so models should “pay more attention” to some features and “pay less attention” to other features. <br>
Does that sound familiar? You’ve already encountered a similar concept twice in this book:
- **Max pooling** in convnets looks at a pool of features in a spatial region and selects just one feature to keep. That’s an “all or nothing” form of attention: keep the most important feature and discard the rest.
- **TF-IDF** normalization assigns importance scores to tokens based on how much information different tokens are likely to carry. Important tokens get boosted while irrelevant tokens get faded out. That’s a continuous form of attention.

There are many different forms of attention you could imagine, but they all start by computing importance scores for a set of features, with higher scores for more relevant features and lower scores for less relevant ones (see figure 11.5). How these scores should be computed, and what you should do with them, will vary from approach to approach.

![](./images/11.5.png)

Crucially, this kind of attention mechanism can be used for more than just highlighting or erasing certain features. It can be used to make features context-aware. You’ve just learned about word embeddings—vector spaces that capture the “shape” of the semantic relationships between different words. In an embedding space, a single word has a fixed position—a fixed set of relationships with every other word in the space. But that’s not quite how language works: the meaning of a word is usually context-specific. When you mark the date, you’re not talking about the same “date” as when you go on a date, nor is it the kind of date you’d buy at the market. When you say, “I’ll see you soon,” the meaning of the word “see” is subtly different from the “see” in “I’ll see this project to its end” or “I see what you mean.” And, of course, the meaning of pronouns like “he,” “it,” “in,” etc., is entirely sentence-specific and can even change multiple times within a single sentence. <br>
Clearly, a smart embedding space would provide a different vector representation for a word depending on the other words surrounding it. That’s where self-attention comes in. The purpose of self-attention is to modulate the representation of a token by using the representations of related tokens in the sequence. This produces context aware token representations. Consider an example sentence: “The train left the station on time.” Now, consider one word in the sentence: station. What kind of station are we talking about? Could it be a radio station? Maybe the International Space Station? <br>
Let’s figure it out algorithmically via **self-attention** (see figure 11.6).

![](./images/11.6.png)

- Step 1 is to compute relevancy scores between the vector for “station” and every other word in the sentence. 
  - These are our “**attention scores**.” We’re simply going to use the dot product between two word vectors as a measure of the strength of their relationship. 
  - It’s a very computationally efficient distance function, and it was already the standard way to relate two word embeddings to each other long before Transformers. 
  - In practice, these scores will also go through a scaling function and a softmax, but for now, that’s just an implementation detail.
- Step 2 is to compute the sum of all word vectors in the sentence, weighted by our relevancy scores. 
  - Words closely related to “station” will contribute more to the sum (including the word “station” itself), while irrelevant words will contribute almost nothing. 
  - The resulting vector is our new representation for “station”: a representation that incorporates the surrounding context. 
  - In particular, it includes part of the “train” vector, clarifying that it is, in fact, a “train station.”
- You’d repeat this process for every word in the sentence, producing a new sequence of vectors encoding the sentence. 

Let’s see it in NumPy-like pseudocode:

```python
def self_attention(input_sequence):
  output = np.zeros(shape=input_sequence.shape)
  # Iterate over each token in the input sequence.
  for i, pivot_vector in enumerate(input_sequence):
    scores = np.zeros(shape=(len(input_sequence),))
    for j, vector in enumerate(input_sequence):
      # Compute the dot product (attention score) between the token and every other token.
      scores[j] = np.dot(pivot_vector, vector.T)
    # Scale by a normalization factor, and apply a softmax.
    scores /= np.sqrt(input_sequence.shape[1])
    scores = softmax(scores)
    new_pivot_representation = np.zeros(shape=pivot_vector.shape)
    for j, vector in enumerate(input_sequence):
      # Take the sum of all tokens weighted by the attention scores.
      new_pivot_representation += vector * scores[j]
    output[i] = new_pivot_representation # That sum is our output.
  return output
```

Of course, in practice you’d use a vectorized implementation. Keras has a built-in layer to handle it: the MultiHeadAttention layer. Here’s how you would use it:

```python
num_heads = 4
embed_dim = 256
mha_layer = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
outputs = mha_layer(inputs, inputs, inputs)
```

Reading this, you’re probably wondering
- Why are we passing the inputs to the layer three times? That seems redundant.
- What are these “multiple heads” we’re referring to? That sounds intimidating—do they also grow back if you cut them?

Both of these questions have simple answers. Let’s take a look.

##### GENERALIZED SELF-ATTENTION: THE QUERY-KEY-VALUE MODEL
So far, we have only considered one input sequence. However, the **Transformer** architecture was originally developed for **machine translation**, where you have to deal with two input sequences: the source sequence you’re currently translating (such as “How’s the weather today?”), and the target sequence you’re converting it to (such as “¿Qué tiempo hace hoy?”). A **Transformer** is a **sequence-to-sequence model**: it was designed to convert one sequence into another. You’ll learn about sequence-to-sequence models in depth later in this chapter.

Now let’s take a step back. The **self-attention** mechanism as we’ve introduced it performs the following, schematically:

![](./images/self_attention.png)

This means “for each token in inputs (A), compute how much the token is related to every token in inputs (B), and use these scores to weight a sum of tokens from inputs (C).” Crucially, there’s nothing that requires A, B, and C to refer to the same input sequence. In the general case, you could be doing this with three different sequences. We’ll call them “query,” “keys,” and “values.” The operation becomes “for each element in the query, compute how much the element is related to every key, and use these scores to weight a sum of values”:

```python
outputs = sum(values * pairwise_scores(query, keys))
```

This terminology comes from search engines and recommender systems (see figure 11.7). Imagine that you’re typing up a query to retrieve a photo from your collection— “dogs on the beach.” Internally, each of your pictures in the database is described by a set of keywords—“cat,” “dog,” “party,” etc. We’ll call those “**keys**.” The search engine will start by comparing your query to the keys in the database. “Dog” yields a match of 1, and “cat” yields a match of 0. It will then rank those keys by strength of match—relevance—and it will return the pictures associated with the top N matches, in order of relevance. <br>
Conceptually, this is what Transformer-style attention is doing. You’ve got a reference sequence that describes something you’re looking for: the **query**. You’ve got a body of knowledge that you’re trying to extract information from: the **values**. Each value is assigned a key that describes the value in a format that can be readily compared to a query. You simply match the query to the keys. Then you return a weighted sum of values. <br>

![](./images/11.7.png)

In practice, the keys and the values are often the same sequence. In machine translation, for instance, the query would be the target sequence, and the source sequence would play the roles of both keys and values: for each element of the target (like “tiempo”), you want to go back to the source (“How’s the weather today?”) and identify the different bits that are related to it (“tiempo” and “weather” should have a strong match). And naturally, if you’re just doing sequence classification, then query, keys, and values are all the same: **you’re comparing a sequence to itself**, to enrich each token with context from the whole sequence. <br>
That explains why we needed to pass inputs three times to our **MultiHeadAttention** layer. But why “multi-head” attention?

##### Multi-head attention

“Multi-head attention” is an extra tweak to the self-attention mechanism, introduced in “Attention is all you need.” The “multi-head” moniker refers to the fact that the output space of the self-attention layer gets factored into a set of independent subspaces, learned separately: the initial query, key, and value are sent through three independent sets of dense projections, resulting in three separate vectors. Each vector is processed via neural attention, and the three outputs are concatenated back together into a single output sequence. Each such subspace is called a “head.” The full picture is shown in figure 11.8. <br>
The presence of the learnable dense projections enables the layer to actually learn something, as opposed to being a purely stateless transformation that would require additional layers before or after it to be useful. In addition, having independent heads helps the layer learn different groups of features for each token, where features within one group are correlated with each other but are mostly independent from features in a different group.

![](./images/11.8.png)

This is similar in principle to what makes depthwise separable convolutions work: in a depthwise separable convolution, the output space of the convolution is factored into many subspaces (one per input channel) that get learned independently. The “Attention is all you need” paper was written at a time when the idea of factoring feature spaces into independent subspaces had been shown to provide great benefits for computer vision models—both in the case of depthwise separable convolutions, and in the case of a closely related approach, grouped convolutions. Multi-head attention is simply the application of the same idea to self-attention.

##### The Transformer encoder
If adding **extra dense projections** is so useful, why don’t we also apply one or two to the **output of the attention mechanism**? Actually, that’s a great idea—let’s do that. And our model is starting to do a lot, so we might want to add **residual connections** to make sure we don’t destroy any valuable information along the way—you learned in chapter 9 that they’re a must for any sufficiently deep architecture. And there’s another thing you learned in chapter 9: **normalization layers** are supposed to help gradients flow better during backpropagation. Let’s add those too. <br>
That’s roughly the thought process that I imagine unfolded in the minds of the inventors of the **Transformer** architecture at the time. Factoring outputs into multiple independent spaces, adding residual connections, adding normalization layers—all of these are standard architecture patterns that one would be wise to leverage in any complex model. Together, these bells and whistles form the **Transformer encoder**—one of two critical parts that make up the **Transformer** architecture (see figure 11.9).

![](./images/11.9.png)

The original **Transformer** architecture consists of two parts: 
- a **Transformer encoder** that processes the source sequence, and 
- a **Transformer decoder** that uses the source sequence to generate a translated version. 

You’ll learn about about the **decoder** part in a minute. <br>
Crucially, the **encoder** part can be used for **text classification**—it’s a very generic module that ingests a sequence and learns to turn it into a more useful representation. <br>
Let’s implement a **Transformer encoder** and try it on the movie review sentiment classification task.

##### Getting the data

In [2]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
!rm -r aclImdb/train/unsup

##### Preparing the data

In [3]:
import os, pathlib, shutil, random
from tensorflow import keras
batch_size = 32
base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)
text_only_train_ds = train_ds.map(lambda x, y: x)

##### Vectorizing the data

In [4]:
from tensorflow.keras import layers

max_length = 600
max_tokens = 20000
text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

##### Transformer encoder implemented as a subclassed Layer

In [5]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim # Size of the input token vectors
        self.dense_dim = dense_dim # Size of the inner dense layer
        self.num_heads = num_heads # Number of attention heads
        self.attention = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
        [layers.Dense(dense_dim, activation="relu"),
         layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        
    def call(self, inputs, mask=None): # Computation goes in call().
        # The mask that will be generated by the Embedding layer will be 2D, but the attention layer expects to be 3D or 4D, so we expand its rank.
        if mask is not None:
            mask = mask[:, tf.newaxis, :]
        attention_output = self.attention(inputs, inputs, attention_mask=mask)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)
    
    # Implement serialization so we can save the model.
    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

You’ll note that the normalization layers we’re using here aren’t **BatchNormalization** layers like those we’ve used before in image models. That’s because **BatchNormalization** doesn’t work well for sequence data. Instead, we’re using the **LayerNormalization** layer, which normalizes each sequence independently from other sequences in the batch.<br>
Like this, in NumPy-like pseudocode:
```python
def layer_normalization(batch_of_sequences): # Input shape (batch_size, sequence_length, embedding_dim)
    # To compute mean and variance, we only pool data over the last axis (axis -1).
    mean = np.mean(batch_of_sequences, keepdims=True, axis=-1)
    variance = np.var(batch_of_sequences, keepdims=True, axis=-1)
    return (batch_of_sequences - mean) / variance
```
Compare to BatchNormalization (during training):
```python
def batch_normalization(batch_of_images): # Input shape: (batch_size, height, width, channels)
    # Pool data over the batch axis (axis 0), which creates interactions between samples in a batch.
    mean = np.mean(batch_of_images, keepdims=True, axis=(0, 1, 2))
    variance = np.var(batch_of_images, keepdims=True, axis=(0, 1, 2))
    return (batch_of_images - mean) / variance
```
While **BatchNormalization** collects information from many samples to obtain accurate statistics for the feature means and variances, **LayerNormalization** pools data within each sequence separately, which is more appropriate for sequence data. <br>
Now that we’ve implemented our **TransformerEncoder**, we can use it to assemble a text-classification model similar to the GRU-based one you’ve seen previously.

##### Using the Transformer encoder for text classification

In [6]:
vocab_size = 20000
embed_dim = 256
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
x = layers.Embedding(vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x = layers.GlobalMaxPooling1D()(x) # Since TransformerEncoder returns full sequences, we need to reduce each sequence to a single vector for classification via a global pooling layer.
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)

model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

##### Training and evaluating the Transformer encoder based model

In [7]:
callbacks = [
    keras.callbacks.ModelCheckpoint("transformer_encoder.keras",
                                    save_best_only=True)
]

model.fit(int_train_ds, validation_data=int_val_ds, epochs=20, callbacks=callbacks)

model = keras.models.load_model(
    "transformer_encoder.keras",
    custom_objects={"TransformerEncoder": TransformerEncoder}) # Provide the custom TransformerEncoder class to the model-loading process.
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

It gets to 88.3% test accuracy—slightly worse than the GRU model.

At this point, you should start to feel a bit uneasy. Something’s off here. Can you tell what it is? <br>
This section is ostensibly about “sequence models.” I started off by highlighting the importance of word order. I said that **Transformer** was a sequence-processing architecture, originally developed for machine translation. And yet . . . the **Transformer encoder** you just saw in action wasn’t a sequence model at all. Did you notice? It’s composed of dense layers that process sequence tokens independently from each other, and an attention layer that looks at the tokens as a set. You could change the order of the tokens in a sequence, and you’d get the exact same pairwise attention scores and the exact same context-aware representations. If you were to completely scramble the words in every movie review, the model wouldn’t notice, and you’d still get the exact same accuracy. Self-attention is a set-processing mechanism, focused on the relationships between pairs of sequence elements (see figure 11.10)—it’s blind to whether these elements occur at the beginning, at the end, or in the middle of a sequence. So why do we say that Transformer is a sequence model? And how could it possibly be good for machine translation if it doesn’t look at word order?

![](./images/11.10.png)

I hinted at the solution earlier in the chapter: I mentioned in passing that **Transformer** was a hybrid approach that is technically order-agnostic, but that manually injects order information in the representations it processes. This is the missing ingredient! <br>
It’s called **positional encoding**. Let’s take a look.

##### USING POSITIONAL ENCODING TO RE-INJECT ORDER INFORMATION
The idea behind positional encoding is very simple: to give the model access to word order information, we’re going to add the word’s position in the sentence to each word embedding. Our input word embeddings will have two components: the usual word vector, which represents the word independently of any specific context, and a position vector, which represents the position of the word in the current sentence. Hopefully, the model will then figure out how to best leverage this additional information. <br>
The simplest scheme you could come up with would be to concatenate the word’s position to its embedding vector. You’d add a “position” axis to the vector and fill it with 0 for the first word in the sequence, 1 for the second, and so on. <br>
That may not be ideal, however, because the positions can potentially be very large integers, which will disrupt the range of values in the embedding vector. As you know, neural networks don’t like very large input values, or discrete input distributions. <br>
The original “Attention is all you need” paper used an interesting trick to encode word positions: it added to the word embeddings a vector containing values in the range [-1, 1] that varied cyclically depending on the position (it used cosine functions to achieve this). This trick offers a way to uniquely characterize any integer in a large range via a vector of small values. It’s clever, but it’s not what we’re going to use in our case. We’ll do something simpler and more effective: we’ll learn position embedding vectors the same way we learn to embed word indices. We’ll then proceed to add our position embeddings to the corresponding word embeddings, to obtain a position-aware word embedding. This technique is called “positional embedding.” <br>
Let’s implement it.

##### Implementing positional embedding as a subclassed layer

In [8]:
class PositionalEmbedding(layers.Layer):
    # downside of position embeddings is that the sequence length needs to be known in advance.
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)
        # Prepare an Embedding layer for the token indices.
        self.token_embeddings = layers.Embedding(
            input_dim=input_dim, output_dim=output_dim)
        # And another one for the token positions
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=output_dim)
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        # Add both embedding vectors together.
        return embedded_tokens + embedded_positions

    # Like the Embedding layer, this layer should be able to generate a mask so we can ignore padding 0s in the inputs. 
    # The compute_mask method will called automatically by the framework, and the mask will get propagated to the next layer.
    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)

    # Implement serialization so we can save the model.
    def get_config(self):
        config = super().get_config()
        config.update({
            "output_dim": self.output_dim,
            "sequence_length": self.sequence_length,
            "input_dim": self.input_dim,
        })
        return config

You would use this PositionEmbedding layer just like a regular Embedding layer. Let’s see it in action!
##### PUTTING IT ALL TOGETHER: A TEXT-CLASSIFICATION TRANSFORMER
All you have to do to start taking word order into account is swap the old Embedding layer with our position-aware version.

##### Combining the Transformer encoder with positional embedding

In [9]:
vocab_size = 20000
sequence_length = 600
embed_dim = 256
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("full_transformer_encoder.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=20, callbacks=callbacks)
model = keras.models.load_model(
    "full_transformer_encoder.keras",
    custom_objects={"TransformerEncoder": TransformerEncoder,
                    "PositionalEmbedding": PositionalEmbedding})
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

##### When to use sequence models over bag-of-words models
You may sometimes hear that bag-of-words methods are outdated, and that Transformer based sequence models are the way to go, no matter what task or dataset you’re looking at. This is definitely not the case: a small stack of Dense layers on top of a bag-of bigrams remains a perfectly valid and relevant approach in many cases. In fact, among the various techniques that we’ve tried on the IMDB dataset throughout this chapter, the best performing so far was the bag-of-bigrams! <br>
So, when should you prefer one approach over the other? <br>
In 2017, my team and I ran a systematic analysis of the performance of various text classification techniques across many different types of text datasets, and we discovered a remarkable and surprising rule of thumb for deciding whether to go with a bag-of-words model or a sequence model (http://mng.bz/AOzK)—a golden constant of sorts. <br>
It turns out that when approaching a new text-classification task, you should pay close attention to the ratio between the number of samples in your training data and the mean number of words per sample (see figure 11.11). If that ratio is small—less than 1,500—then the bag-of-bigrams model will perform better (and as a bonus, it will be much faster to train and to iterate on too). If that ratio is higher than 1,500, then you should go with a sequence model. In other words, sequence models work best when lots of training data is available and when each sample is relatively short.

![](./images/11.11.png)

So if you’re classifying 1,000-word long documents, and you have 100,000 of them (a ratio of 100), you should go with a bigram model. If you’re classifying tweets that are 40 words long on average, and you have 50,000 of them (a ratio of 1,250), you should also go with a bigram model. But if you increase your dataset size to 500,000 tweets (a ratio of 12,500), go with a Transformer encoder. What about the IMDB movie review classification task? We had 20,000 training samples and an average word count of 233, so our rule of thumb points toward a bigram model, which confirms what we found in practice. <br>
This intuitively makes sense: the input of a sequence model represents a richer and more complex space, and thus it takes more data to map out that space; meanwhile, a plain set of terms is a space so simple that you can train a logistic regression on top using just a few hundreds or thousands of samples. In addition, the shorter a sample is, the less the model can afford to discard any of the information it contains— in particular, word order becomes more important, and discarding it can create ambiguity. The sentences “this movie is the bomb” and “this movie was a bomb” have very close unigram representations, which could confuse a bag-of-words model, but a sequence model could tell which one is negative and which one is positive. With a longer sample, word statistics would become more reliable and the topic or sentiment would be more apparent from the word histogram alone. <br>
Now, keep in mind that this heuristic rule was developed specifically for text classification. It may not necessarily hold for other NLP tasks—when it comes to machine translation, for instance, Transformer shines especially for very long sequences, compared to RNNs. Our heuristic is also just a rule of thumb, rather than a scientific law, so expect it to work most of the time, but not necessarily every time.