In [1]:
import tensorflow as tf

This code, last update by OpenAI on 2019, uses tensorflow 1.x
Operation/modules/things ```removed``` in tensorflow 2.x:
- tf.variable_scope
- tf.get_variable
- tf.rsqrt
- tf.contrib

People [suggest](https://stackoverflow.com/questions/63350105/how-to-alter-gpt-2-code-to-work-with-tensorflow-2-0) that you should not use beyond Tensorflow=1.15 with this repo, because migration from 1.x to 2.x is

.

.

.

non-trivial.

```python
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained model and tokenizer
model_name = "gpt2"  # You can choose from different sizes: gpt2, gpt2-medium, gpt2-large, gpt2-xl
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Encode input text
input_text = "Once upon a time"
input_ids = tokenizer.encode(input_text, return_tensors='pt')

# Generate text
output = model.generate(input_ids, max_length=50, num_return_sequences=1)

# Decode and print the result
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

```

Let's dig into HuggingFace ```transformers``` library (idea from ChatGPT):
- [Main hole](https://github.com/huggingface/transformers) to dig
- Read the Readme. For example it discusses the benefits of using HF transformers, and mentions "Move a single model between TF2.0/PyTorch/JAX frameworks at will". What is this so-much-repeated JAX and why we haven't dug into it yet? [Flax](https://flax.readthedocs.io/en/latest/): neural networks with JAX

- [Models](https://github.com/huggingface/transformers/tree/main/src/transformers/models) --> [gpt2](https://github.com/huggingface/transformers/tree/main/src/transformers/models/gpt2) -->[TF 2.0](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_tf_gpt2.py) and [PyTorch](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_gpt2.py) impl's.
- gpt2 -> convert_gpt2_original_tf_checkpoint_to_pytorch.py
Script for converting TensorFlow GPT2 checkpoint to PyTorch:
-- ```parser = argparse.ArgumentParser()```
This creates an ArgumentParser object that will handle command-line arguments.

Let's dive deep into TF implementation. Interestingly the ```causal_attention_mask``` [function](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_tf_gpt2.py#:~:text=causal_attention_mask) is copy-pasted from OpenAI gpt2 source code which we studied below. I [digged](https://stackoverflow.com/questions/18220650/github-link-to-function-in-source) to find a way to point to function definition instead of passing the line numbers as files change over time.

```python
def conv1d(x, scope, nf, *, w_init_stdev=0.02):
    with tf.variable_scope(scope):
        *start, nx = shape_list(x)
        w = tf.get_variable('w', [1, nx, nf], initializer=tf.random_normal_initializer(stddev=w_init_stdev))
        b = tf.get_variable('b', [nf], initializer=tf.constant_initializer(0))
        c = tf.reshape(tf.matmul(tf.reshape(x, [-1, nx]), tf.reshape(w, [-1, nf]))+b, start+[nf])
        return c
```
In standard 1D convolution, you would apply a filter that "slides" across the input with a defined stride and performs local dot products. This function instead treats it more like a fully connected layer over the last dimension of x.

If you wanted a true 1D convolution, you'd typically use a function like tf.nn.conv1d or tf.keras.layers.Conv1D.

In [2]:
def bool_attention_mask(nd, ns):
    """1's in the lower triangle, counting from the lower right corner.

    Same as tf.matrix_band_part(tf.ones([nd, ns]), -1, ns-nd), but doesn't produce garbage on TPUs.
    """
    i = tf.range(nd)[:,None]
    j = tf.range(ns)
    m = i >= j - ns + nd  # (i, j) below and on the main diagonal <--> i - nd >= j - ns
    return m
'''
args after * are keyword_only and should be called using the keyword syntax, not positionally, and are required
'''
m = bool_attention_mask(3, 4)
assert m.dtype == tf.bool
'''
# do not use dtype for bool_attention_mask.
def attention_mask(nd, ns, *, dtype):
    return tf.cast(bool_attention_mask(nd, ns), dtype)

m = attention_mask(3, 4, dtype=tf.float32)
# TypeError: bool_attention_mask() missing 1 required keyword-only argument: 'dtype'
'''
def attention_mask(ns, nd, *, dtype):
    return tf.cast(bool_attention_mask(nd, ns), dtype)

In [3]:
def split_states(x, n):
    *start, m = x.shape
    return tf.reshape(x, start + [n, m//n])

x = tf.range(12)
assert x.shape == (12,) and x.shape == (12) and type(x.shape) == tf.TensorShape and x.shape.as_list() == [12]
print(type(x.shape))
*start, m = x.shape
assert start == [] and m == 12
'''
start, m = x.shape is wrong:
# ValueError: not enough values to unpack (expected 2, got 1)
'''
x = tf.reshape(x, (2,6))
# x.shape #TensorShape([2,6])
*start, m = x.shape
assert start == [2] and tf.reshape(x, start + [2,3]).shape == [2, 2, 3]
assert tf.reduce_all(split_states(x, 2) == tf.reshape(x, start + [2,3]))
'''
*a, *b, m = tf.range(12).shape
SyntaxError: multiple starred expressions in assignment
In Python, you can only use one starred expression (*) when unpacking elements from a sequence.
Multiple starred expressions (e.g., *a, *b, m) are not allowed, which is why you're getting a SyntaxError.
'''

<class 'tensorflow.python.framework.tensor_shape.TensorShape'>


"\n*a, *b, m = tf.range(12).shape\nSyntaxError: multiple starred expressions in assignment\nIn Python, you can only use one starred expression (*) when unpacking elements from a sequence.\nMultiple starred expressions (e.g., *a, *b, m) are not allowed, which is why you're getting a SyntaxError.\n"

In [4]:
def merge_states(x):
    *start, a, b = x.shape #unpacking shape sequence
    return tf.reshape(x, start + [a*b])
# see split_states() block for unpacking the shape using *start

In [10]:
def shape_list(x):
    static = x.shape.as_list()
    dynamic = tf.shape(x)
    return [dynamic[i] if s is None else s for i, s in enumerate(static)]

x = tf.range(12)
x = tf.reshape(x, [2, 6])
static = x.shape.as_list()

x = tf.reshape(x, (3,4))
assert tf.shape(x).dtype == tf.int32
type(tf.shape(x))

'''
# in tensorflow 1.x
# x = tf.placeholder(tf.float32, shape=[None, 10])  # `None` indicates an unknown batch size
'''
x = tf.keras.Input(shape=(10,))
assert x.shape == (None, 10)
# print(tf.shape(x))
'''
ValueError: A KerasTensor cannot be used as input to a TensorFlow function. A KerasTensor is a symbolic placeholder for a shape and dtype, used when constructing Keras Functional models or Keras Functions. You can only use it as input to a Keras layer or a Keras operation (from the namespaces `keras.layers` and `keras.operations`). You are likely doing something like:

x = Input(...)
...
tf_fn(x)  # Invalid.

What you should do instead is wrap `tf_fn` in a layer:

class MyLayer(Layer):
    def call(self, x):
        return tf_fn(x)

x = MyLayer()(x)

'''

'\nValueError: A KerasTensor cannot be used as input to a TensorFlow function. A KerasTensor is a symbolic placeholder for a shape and dtype, used when constructing Keras Functional models or Keras Functions. You can only use it as input to a Keras layer or a Keras operation (from the namespaces `keras.layers` and `keras.operations`). You are likely doing something like:\n\nx = Input(...)\n...\ntf_fn(x)  # Invalid.\n\nWhat you should do instead is wrap `tf_fn` in a layer:\n\nclass MyLayer(Layer):\n    def call(self, x):\n        return tf_fn(x)\n\nx = MyLayer()(x)\n\n'

TensorFlow 1.x:

tf.placeholder is used to create a placeholder with a shape where the first dimension (the batch size) is not defined (None). This is common in TensorFlow 1.x where the batch size is typically dynamic.

TensorFlow 2.x:

Using tf.keras.Input is a common way to work with dynamic shapes in TensorFlow 2.x, especially when defining models. For general tensors, you can work with them directly without placeholders.

In [None]:
def attn(x, scope, n_state, *, past, hparams):
    assert x.shape.ndims == 3 # [batch, sequence, features] ([B, T, C])
    assert n_state % hparams.n_head == 0
    if past is not None:
        assert past.shape.ndims == 5 # [batch, 2, heads, sequence, features]

    def split_heads(x):
        '''
        [batch, sequence, features] -> [batch, head, sequence, features]
        '''
        return tf.transpose(split_states(x, hparams.n_head), (0, 2, 1, 3))

    def merge_heads(x):
        '''
        [batch, head, sequence, features] -> [batch, sequence, features]
        '''
        return merge_states(tf.transpose(x, (0, 2, 1, 3)))

    def mask_attn_weights(w):
        # w.shape == [batch, heads, dst_sequence, src_sequence]
        _, _, nd, ns = shape_list(w)
        b = attention_mask(nd, ns, dtype=w.dtype)
        b = tf.reshape(b, [1,1,nd,ns])
        w = w*b - tf.cast(1e10, w.dtype)*(1-b) # 0 in mask -> not attend to it
        # * is element-wise multiplication
        return w

    def multihead_attn(q, k, v):
        # q, k, v shapes = [batch, heads, sequence, features]
        w = tf.matmul(q, k, transpose_b=True)
        w = w * tf.rsqrt(tf.cast(v.shape[-1].value, w.dtype))


digger [post](https://stackoverflow.com/questions/34192229/efficient-element-wise-multiplication-of-a-matrix-and-a-vector-in-tensorflow):

\* vs tf.multiply() vs \_\_mul\_\_()

tf.matmul [docs](https://www.tensorflow.org/api_docs/python/tf/linalg/matmul):

If one or both of the matrices contain a lot of zeros, a more efficient multiplication algorithm can be used by setting the corresponding a_is_sparse or b_is_sparse flag to True. These are False by default. This optimization is only available for plain matrices (rank-2 tensors) with datatypes bfloat16 or float32.

In [None]:
tf.rsqrt(3.0)

AttributeError: module 'tensorflow' has no attribute 'rsqrt'