#Embedding Layer
dataset = create_tf_dataset(encoded_text, block_size)

The above code snippt:

1. Takes raw 1D token IDs

2. Creates sliding windows

3. Produces (input_ids, target_ids) pairs

Each element of dataset is: (input_ids, target_ids)
with shape :
                input_ids  â†’ (block_size,)
                target_ids â†’ (block_size,)




In [None]:
with open("/content/The_Verdict.txt", "r", encoding="utf-8") as f:
  text = f.read()

print(text[:100])

ï»¿I HAD always thought Jack Gisburn rather a cheap genius-- though a good fellow enough--so it was no


In [None]:
!pip3 install tiktoken



In [None]:
import importlib
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")

In [None]:
IDs = tokenizer.encode(text)

print(IDs[:10],len(IDs))

[171, 119, 123, 40, 367, 2885, 1464, 1807, 3619, 402] 5170


In [None]:
vocab_size = tokenizer.n_vocab# vocab size of gpt2
print(vocab_size)

50257


In [None]:
import tensorflow as tf


block_size = 64

def create_tf_dataset(ids, block_size):
    ids = tf.constant(ids, dtype=tf.int32)
    dataset = tf.data.Dataset.from_tensor_slices(ids)
    dataset = dataset.window(block_size + 1, shift=1, drop_remainder=True)
    dataset = dataset.flat_map(lambda x: x.batch(block_size + 1))
    dataset = dataset.map(lambda x: (x[:-1], x[1:]))
    return dataset

dataset = create_tf_dataset(IDs, block_size)

for x, y in dataset:
    print("Input:", x.numpy(),"\n", "Target:", y.numpy())
    break



Input: [  171   119   123    40   367  2885  1464  1807  3619   402   271 10899
  2138   257  7026 15632   438   996   257   922  5891  1576   438   568
   340   373   645  1049  5975   284   502   284  3285   326    11   287
   262  6001   286   465 13476    11   339   550  5710   465 12036    11
  6405   257  5527 27075    11   290  4920  2241   287   257  4489    64
   319   262 34686 41976] 
 Target: [  119   123    40   367  2885  1464  1807  3619   402   271 10899  2138
   257  7026 15632   438   996   257   922  5891  1576   438   568   340
   373   645  1049  5975   284   502   284  3285   326    11   287   262
  6001   286   465 13476    11   339   550  5710   465 12036    11  6405
   257  5527 27075    11   290  4920  2241   287   257  4489    64   319
   262 34686 41976    13]


#Still cannot feed 'dataset' directly into the embedding layer.

Because:

1. The embedding layer expects only input_ids

2. The dataset yields pairs

#âœ… Correct way to use this datase
1.  Add batching (VERY IMPORTANT)

    After batching, each element in dataset is a batch of input-target pairs, not a single pair.

    Each batch contains B(batch size) input-target pairs

2.  Feed ONLY input_ids to the embedding layer


**Before batching**

      Each dataset element was:
             (input_ids, target_ids)
              shapes: (block_size,), (block_size,)

**After batching**

Each dataset element becomes:

(

        [[...], [...], [...], ...],  # B windows â†’ input_ids

        [[...], [...], [...], ...]   # B windows â†’ target_ids

)

That is:

B(batch size) training examples processed together


In [None]:
batch_size = 16

batched_dataset = dataset.batch(batch_size, drop_remainder=True) #Drop last incomplete batch

#What is dataset after batching?
Your dataset is: tf.data.Dataset

more specifically ===> tf.data.Dataset[(input_ids, target_ids)]

After batching each element is :

    (

      Tensor(shape=(B, T), dtype=int32),

      Tensor(shape=(B, T), dtype=int32)

    )
It is NOT:

1. a Python list

2. a NumPy array

3. indexable

4. length-known (in most cases)

We can't find its length because:

1. tf.data.Dataset is a lazy streaming pipeline

2. Elements are created on-the-fly

3. Length is often unknown or infinite


Why TensorFlow designed it this way

Because:

1. Datasets may be too large to fit in memory

2. Streaming allows infinite datasets

3. Works efficiently with GPUs/TPUs

4. Supports prefetching, shuffling, parallelism

5. This is exactly how real LLM training pipelines work.




In [None]:
for input_ids, target_ids in batched_dataset.take(1): #take the first batch
    print(input_ids.shape, target_ids.shape)


# (B, T) = (16, 64) means The first batch contains 16 training examples (windows) .Each training example is a sequence of 64 token IDs
# It means that first batch contains 16 training input IDs and 16 target target IDs and each of the 16 examples  contains 64 IDs

(16, 64) (16, 64)


In [None]:
d_model = 256

embedding = tf.keras.layers.Embedding(
    input_dim=vocab_size,
    output_dim=d_model
)

for input_ids, target_ids in batched_dataset:
    x = embedding(input_ids)


 The above  loop processes ONE batch at a time.
# Watch happens conceptually:
Batch 1 â†’ embedding â†’ x

Batch 2 â†’ embedding â†’ x

Batch 3 â†’ embedding â†’ x
...

Only one batch lives in memory at a time (unless prefetching). Loop = requests next batch from pipeline

#ðŸ§  Final summary (lock this in)

âœ” Dataset loop processes one batch at a time

âœ” No full dataset is loaded

âœ” x = embeddings of current batch only

âœ” Pipeline = lazy, streaming, efficient

âœ” Same logic used in real GPT training

#What is the type of x?
<class 'tensorflow.python.framework.ops.EagerTensor'>

#What values does x contain?

x contains floating-point vectors

Initially random (because embeddings are randomly initialized)

During training â†’ updated via backprop


In [None]:
x.shape,x[0,0]



(TensorShape([16, 64, 256]),
 <tf.Tensor: shape=(256,), dtype=float32, numpy=
 array([-0.04556935, -0.02706144, -0.0355917 ,  0.03820449, -0.02867883,
        -0.04321986,  0.03961125,  0.04394214,  0.00092129,  0.00319831,
        -0.01940831, -0.02392514, -0.01141023, -0.02854631, -0.01890888,
        -0.02656835,  0.03171286, -0.03796707, -0.04148136,  0.03493017,
         0.03353545, -0.01023757,  0.0005229 ,  0.04183633,  0.01711143,
        -0.0150934 ,  0.03337849,  0.02655654,  0.02769126,  0.0389016 ,
        -0.0146213 , -0.02091422,  0.01455227, -0.03622209, -0.02376009,
        -0.0185237 ,  0.0148176 , -0.03226655, -0.01990002, -0.04868096,
         0.04151252,  0.01559379,  0.02412397, -0.03423765, -0.03266606,
        -0.0474977 ,  0.0432101 ,  0.0099595 ,  0.0137164 , -0.01885684,
         0.04175111, -0.03838579, -0.03630503,  0.0493512 , -0.03476142,
        -0.02578317, -0.01207443, -0.0257071 ,  0.04338994, -0.01243303,
         0.00681757,  0.04745338, -0.02515708,