# Prepare data format (TFRecord recommended)

### What you’re doing:
You’re taking raw examples (text, labels, maybe metadata) and encoding them into a binary format that TensorFlow can read fast and reliably. The usual choice is TFRecord. TFRecord stores serialized tf.train.Example protobufs. This reduces I/O overhead, makes shuffling and parallel reading efficient, and keeps the training code cleanly separated from raw file parsing.



### Why TFRecord?
- IO efficiency: Sequential binary reads are faster and easier to optimize than many small text reads.
- Compatibility: tf.data.TFRecordDataset plugs directly into TensorFlow pipelines.
- Schema stability: You define feature names and types that the parser expects.
- Preprocessing options: You can precompute token ids and store them (fast) or store raw text and tokenize on-the-fly (flexible).
  
### TFRecord structure basics
A single TFRecord file is a sequence of serialized tf.train.Example objects. Each Example contains a features map of named fields. Each field is a Feature which can be:
- bytes_list (for raw strings or serialized objects),
- int64_list (for integers, labels, token ids),
- float_list (for floats).
  
Typical features for text classification
- text (bytes) — the raw text string or pre-tokenized string.
- label (int64) — class label.
- Optional: input_ids (int64_list) — token ids if pre-tokenized.
- Optional: attention_mask (int64_list) — 1/0 mask for padding.
- Optional: metadata (bytes) — JSON or other small metadata.
  
### Why not just use Hugging Face transformers for embedding our raw text?
- **Task-Specific Needs**: Not all ML tasks require the power and complexity of LLMs. For example, a simple sentiment analysis task on a small dataset might perform well with a lightweight tokenizer and embeddings tailored to the dataset, rather than using a large pre-trained model.
- **Flexibility**: TFRecord allows you to preprocess once and reuse the data efficiently across multiple training runs. Hugging Face embeddings can be computed on-the-fly, but this adds computational overhead during training.
- **Scalability**: For large datasets, precomputing embeddings and storing them in TFRecord can save time and resources.
- **Customizability**: TFRecord lets you store additional features (e.g., metadata, labels) alongside embeddings, which can be useful for complex pipelines.
- **Integration**: TFRecord integrates seamlessly with TensorFlow's `tf.data` API, enabling efficient data loading and preprocessing.

Using a Hugging Face transformer to embed text is a great option when you want to leverage pre-trained language models for feature extraction. However, not every machine learning task requires large language models (LLMs). Here's why we are making our own embeddings and data formats instead of relying solely on online libraries:

### Example of a non-LLM task
Consider a recommendation system for a small e-commerce platform. Instead of using LLMs, you might:
- Tokenize product descriptions with a simple regex tokenizer.
- Use a small vocabulary to create embeddings.
- Combine these embeddings with user interaction data (e.g., clicks, purchases) stored in TFRecord format.

This approach is lightweight, interpretable, and sufficient for the task, without the overhead of LLMs.


In [None]:
%pip install tensorflow
%pip install numpy

In [None]:
# how to write TFrecords
import tensorflow as tf

def _bytes_feature(value: bytes):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _int64_feature(value: int):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def _int64_list_feature(values):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))

# this function takes in a list of texts and labels and writes them to a TFRecord file
# basically it converts the text ex and label into tf.train.Example and writes them to a TFRecord file in a format good for tensorflow
# the ids are optional but here we give each example an input_ids field as well
def write_examples(output_path, texts, labels, input_ids_list=None):
    with tf.io.TFRecordWriter(output_path) as w:
        for i, text in enumerate(texts):
            feature = {
                'text': _bytes_feature(text.encode('utf-8')),
                'label': _int64_feature(int(labels[i])),
            }
            if input_ids_list is not None:
                feature['input_ids'] = _int64_list_feature(input_ids_list[i])
            example = tf.train.Example(features=tf.train.Features(feature=feature))
            w.write(example.SerializeToString())


### How to parse TFRecords in tf.data
When sequences have variable length (token ids), store them as VarLenFeature and convert to dense with padding.


In [None]:
# what we mean by parsing is we take the data we 
import tensorflow as tf

feature_description = {
    'text': tf.io.FixedLenFeature([], tf.string),
    'label': tf.io.FixedLenFeature([], tf.int64),
    'input_ids': tf.io.VarLenFeature(tf.int64),  # variable-length list
}

def _parse_function(serialized_example, max_len=128):
    example = tf.io.parse_single_example(serialized_example, feature_description)
    text = example['text']  # tf.string
    label = tf.cast(example['label'], tf.int32)
    input_ids_sparse = example.get('input_ids')
    if input_ids_sparse is not None:
        input_ids = tf.sparse.to_dense(input_ids_sparse, default_value=0)  # shape=(None,)
        input_ids = input_ids[:max_len]
        pad_len = max_len - tf.shape(input_ids)[0]
        input_ids = tf.cond(pad_len > 0,
                            lambda: tf.pad(input_ids, [[0, pad_len]]),
                            lambda: input_ids)
        input_ids = tf.cast(input_ids, tf.int32)
    else:
        input_ids = tf.zeros([max_len], dtype=tf.int32)  # fallback
    return input_ids, label


### Quick test: write and read TFRecord examples\nThis cell demonstrates writing a TFRecord using the `write_examples` function defined above, then reading and parsing the file to inspect stored features and how `_parse_function` pads/truncates `input_ids`.

In [None]:
# Real-world style TFRecord write + parse demo (sentiment mini-dataset)
import tensorflow as tf
import os, re

########################################
# 1. Simulated raw dataset (product review sentiment)
########################################
texts = [
    "I love this phone, battery life is great",
    "Terrible customer service, not recommended",
    "Camera quality is amazing and fast",
    "The screen cracked easily and support was slow",
]
labels = [1, 0, 1, 0]  # 1 = positive, 0 = negative sentiment

# Simple whitespace/punctuation tokenizer + vocab build (REAL systems use better tokenizers)
def tokenize(text):
    # keep words/apostrophes; lowercase for normalization
    return re.findall(r"[a-zA-Z']+", text.lower())

# Build vocabulary (reserve 0 for padding)
all_tokens = []
for t in texts:
    all_tokens.extend(tokenize(t))
vocab_tokens = sorted(set(all_tokens))
vocab = {tok: i+1 for i, tok in enumerate(vocab_tokens)}  # ids start at 1

# Convert texts to list of token ids
input_ids_list = []
for t in texts:
    toks = tokenize(t)
    ids = [vocab[x] for x in toks]
    input_ids_list.append(ids)

# Show vocab + tokenization mapping
print("Vocabulary size:", len(vocab))
print("Vocabulary mapping (token -> id):", vocab)
for i, t in enumerate(texts):
    print(f"Text {i} tokens: {tokenize(t)} -> ids: {input_ids_list[i]} (len={len(input_ids_list[i])})")

# 2. Write TFRecord file with raw text, label, and token id list
output_path = '/tmp/sentiment_demo.tfrecord'
if os.path.exists(output_path):
    os.remove(output_path)
write_examples(output_path, texts, labels, input_ids_list)
print(f"\nWrote TFRecord file: {output_path} bytes={os.path.getsize(output_path)}")

# 3. Inspect raw serialized examples
raw_ds = tf.data.TFRecordDataset([output_path])
print("\nSerialized example byte lengths:")
for i, raw in enumerate(raw_ds.take(len(texts))):
    print(f" Example {i}: {len(raw.numpy())} bytes (scalar tensor shape {raw.shape})")

# 4. Define feature schema and parse function (variable-length -> fixed length)
feature_description = {
    'text': tf.io.FixedLenFeature([], tf.string),
    'label': tf.io.FixedLenFeature([], tf.int64),
    'input_ids': tf.io.VarLenFeature(tf.int64),
}
def parse_function(serialized, max_len=12):
    ex = tf.io.parse_single_example(serialized, feature_description)
    # raw bytes -> text, label cast
    text = ex['text']
    label = tf.cast(ex['label'], tf.int32)
    # sparse -> dense list of ids
    sparse_ids = ex['input_ids']
    dense_ids = tf.sparse.to_dense(sparse_ids, default_value=0)  # original length
    # truncate
    dense_ids = dense_ids[:max_len]
    pad_len = max_len - tf.shape(dense_ids)[0]
    # right pad with zeros if shorter
    dense_ids = tf.cond(pad_len > 0, lambda: tf.pad(dense_ids, [[0, pad_len]]), lambda: dense_ids)
    dense_ids = tf.cast(dense_ids, tf.int32)
    return {'text': text, 'input_ids': dense_ids, 'label': label}

# Recreate raw dataset (raw_ds was partially consumed above)
raw_ds = tf.data.TFRecordDataset([output_path])
parsed_ds = raw_ds.map(lambda x: parse_function(x, max_len=12))

# 5. Show each parsed example (after padding/truncation)
print("\nParsed examples (fixed length input_ids, max_len=12):")
for i, ex in enumerate(parsed_ds.take(len(texts))):
    print(f" Example {i}:")
    print("   text:", ex['text'].numpy().decode('utf-8'))
    print("   label:", int(ex['label'].numpy()))
    print("   input_ids (len=12):", ex['input_ids'].numpy())

# 6. Batching (how model will typically consume data)
batched = parsed_ds.batch(2)
print("\nBatches (size=2):")
for bi, batch in enumerate(batched):
    print(f" Batch {bi}:")
    print("   labels:", batch['label'].numpy())
    print("   input_ids shape:", batch['input_ids'].shape)
    print("   first review text:", batch['text'][0].numpy().decode('utf-8'))


# Beginner Explanation: What just happened in the TFRecord demo

We'll walk through **every printed section** from the code cell above and connect it to why data formatting (TFRecord + parsing) matters for machine learning.

---
## A. Why format data at all?
Raw data (text, numbers, images) lives in many messy forms. Before training a model we need:
1. **Consistency** – every training example should have the same *shape* (e.g. a fixed-length vector of token ids).
2. **Speed** – binary formats (TFRecord) load faster than reading many tiny text files.
3. **Portability** – you can move TFRecords between machines without changing parsing code.
4. **Separation of concerns** – expensive tokenization/preprocessing can be done once (during writing) instead of every epoch.

So we "format" the raw dataset into a structured, efficient representation that the `tf.data` pipeline can stream.

---
## B. Building a mini real-world dataset
We pretended we have a **sentiment classification** dataset: short product review sentences labeled 1 (positive) or 0 (negative).

Texts:
- Positive: "I love this phone, battery life is great" (label 1)
- Negative: "Terrible customer service, not recommended" (label 0)
- Positive: "Camera quality is amazing and fast" (label 1)
- Negative: "The screen cracked easily and support was slow" (label 0)

Goal: Turn each sentence into a numeric vector (token ids) + its label.

---
## C. Tokenization and vocabulary
Printed lines:
```
Vocabulary size: 25
Vocabulary mapping (token -> id): {...}
Text 0 tokens: ['i', 'love', 'this', 'phone', 'battery', 'life', 'is', 'great'] -> ids: [10, 13, 24, 15, 3, 12, 11, 9] (len=8)
...
```
What this means:
- We split each sentence into lowercase word tokens using a simple regex (`tokenize`).
- We built a **vocabulary**: each unique token gets an integer id (starting at 1). Id **0** is reserved for padding.
- For every text we converted tokens to their ids – these integer lists are variable length (5, 6, 8, ...). Models prefer fixed length, so we'll pad later.

Why do this upfront?
- Numeric ids are what embedding layers / neural networks expect.
- Storing them now inside TFRecord avoids recomputing tokenization every training epoch.

---
## D. Writing the TFRecord file
Printed line:
```
Wrote TFRecord file: /tmp/sentiment_demo.tfrecord bytes=457
```
Meaning:
- We called `write_examples(...)`: for each example we created a `tf.train.Example` protobuf with fields:
  - `text` (raw bytes of the original sentence)
  - `label` (0 or 1)
  - `input_ids` (list of token ids)
- All examples were serialized and appended into one TFRecord file.

Why store both `text` and `input_ids`?
- Flexibility: You can later re-tokenize differently if you wish (you still have raw text).
- Speed: You have precomputed ids ready for training now.

---
## E. Raw serialized examples
Printed lines:
```
Serialized example byte lengths:
 Example 0: 99 bytes (scalar tensor shape ())
 ...
```
Meaning:
- Each line shows the size in bytes of one serialized `tf.train.Example` record.
- Shape `()` means a scalar Tensor whose value is the bytes blob.
- Different lengths happen because sentences and token id lists vary.

---
## F. Parsing (turn bytes back into tensors)
Printed block for each example:
```
Parsed examples (fixed length input_ids, max_len=12):
 Example 0:
   text: I love this phone, battery life is great
   label: 1
   input_ids (len=12): [10 13 24 15  3 12 11  9  0  0  0  0]
...
```
Meaning:
- We read each serialized record and used `tf.io.parse_single_example` with a **feature description**:
  - `FixedLenFeature([], tf.string)` for `text` and `label` (single values).
  - `VarLenFeature(tf.int64)` for `input_ids` (because length varies per example).
- `VarLenFeature` returns a sparse representation; we turned it into a dense vector.
- We then enforced a **fixed length** (`max_len=12`):
  - If original length < 12 → pad zeros on the right.
  - If original length > 12 → truncate (not shown here, but that's what the slice does).
- Result: every example now has a uniform `input_ids` shape `(12,)` suitable for batching and passing to a model.

Why pad/truncate?
- Neural nets usually operate on fixed-size tensors for simplicity and speed (especially when using batches).
- Padding with zeros lets us keep original relative positions of real tokens at the front.

---
## G. Batching
Printed lines:
```
Batches (size=2):
 Batch 0:
   labels: [1 0]
   input_ids shape: (2, 12)
   first review text: I love this phone, battery life is great
...
```
Meaning:
- We grouped examples into batches of 2.
- `input_ids shape: (2, 12)` means: 2 examples per batch, each with length 12 vector.
- Batching lets the model process multiple examples in parallel on the GPU/accelerator.

---
## H. What is actually happening under the hood?
Step-by-step flow for one example:
1. Raw Python string ("Terrible customer service, not recommended").
2. Tokenize into words → `['terrible','customer','service','not','recommended']`.
3. Map words to ids (ids are choosen based on a fixed vocab file each token gets a unique int if and that id is used to look up the corrisposding embedding vector) → `[22,6,19,14,17]`.
4. Create Feature protobuf: `{text: bytes, label: int64, input_ids: int64_list}`.
5. Serialize and write to TFRecord file.
6. Later: Read raw bytes from file via `TFRecordDataset`.
7. Parse bytes back into structured tensors (string, int64, sparse list).
8. Convert sparse list to dense, enforce fixed length (pad zeros) → `[22,6,19,14,17,0,0,0,0,0,0,0]`.
9. Batch with other examples for training.

---
## I. Why TFRecord + tf.data instead of plain Python lists?
- **Streaming**: `tf.data` can prefetch, shuffle, interleave files efficiently.
- **Scalability**: Works the same for 4 examples or 40 million.
- **Performance**: Binary sequential reads are fast and reduce Python overhead.
- **Clean training loop**: Model code sees ready-to-use tensors; no custom per-epoch tokenization logic.

---
## J. The extra TensorFlow log line
```
Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
```
This just means the dataset iteration reached the end. It's informational, not an error.

---
## K. Summary in plain words
We converted human-readable sentences into a machine-friendly, uniform numeric format stored efficiently on disk. Then we loaded that formatted data, padded it to equal lengths, and batched it—exactly what a training loop needs. "Data formatting" here is the process of transforming raw, messy input into clean, consistent tensors.

If you have questions about any single line, just ask which one and I'll zoom in further.