# Data Pipelines in TensorFlow

### What is a “data pipeline” in machine learning?
A data pipeline is the system that feeds data into your model during training and evaluation.
Think of it like a kitchen conveyor belt:
- At one end, you load in raw ingredients (text files, CSVs, TFRecords, images, etc.).
- Along the belt, you wash, cut, and prepare them (decode, normalize, tokenize, batch).
- At the other end, your model gets perfectly prepared “mini-meals” (tensors ready for training).
If this belt is slow, your model waits idle, wasting GPU power.
If it’s too fast, you waste memory.
So ML engineers tune it carefully for balance and throughput.

### Why TensorFlow needs pipelines
TensorFlow models are trained in graphs — computations that run on CPUs, GPUs, or TPUs.
Those devices are fast, but the bottleneck is usually the data loading step:
reading from disk, decoding files, augmenting images, or tokenizing text.

To fix this, TensorFlow provides the tf.data API — a high-performance, graph-integrated data pipeline system.
It lets you:
- Stream data efficiently from disk or memory
- Parallelize operations across CPU cores
- Prefetch batches so the GPU never waits
- Cache preprocessed data to avoid recomputation
- Compose your data transformations like a chain
  
So when we say “TensorFlow pipeline,” we really mean:
A tf.data.Dataset object that describes how to load, process, and batch your data efficiently, often entirely inside the TensorFlow graph.

### What actually happens inside a pipeline
Let’s walk through a typical training example:

```py
dataset = (
    tf.data.TextLineDataset("reviews.txt")     # 1️⃣ Read from file(s)
    .map(parse_line)                           # 2️⃣ Parse or tokenize text
    .shuffle(buffer_size=10000)                # 3️⃣ Shuffle samples for randomness
    .batch(32)                                 # 4️⃣ Group into batches
    .prefetch(tf.data.AUTOTUNE)                # 5️⃣ Prepare next batch while GPU trains
)
```
Let’s break it down:

| Step   | Function                          | What It Does                                                   | Why It Matters                                |
|--------|-----------------------------------|----------------------------------------------------------------|----------------------------------------------|
| 1. Read | TextLineDataset, TFRecordDataset, etc. | Loads data efficiently from disk, streaming, or memory.         | Prevents “file I/O bottlenecks”.             |
| 2. Map  | .map(func)                       | Applies a transformation to each element (like parsing JSON, tokenizing text, decoding images). | Lets you preprocess inside TF (parallelizable). |
| 3. Shuffle | .shuffle(buffer_size)          | Randomizes sample order each epoch.                            | Improves generalization, prevents overfitting. |
| 4. Batch | .batch(batch_size)              | Groups samples into mini-batches for training.                 | Allows vectorized GPU operations.            |
| 5. Prefetch | .prefetch(tf.data.AUTOTUNE)   | Loads next batch while GPU is training on current batch.       | Maximizes GPU utilization.                   |

That’s a complete data pipeline — from reading → preprocessing → batching → feeding.

### What makes TensorFlow pipelines special
TensorFlow’s tf.data pipelines are not just loops — they are part of the graph, meaning:
- They can run asynchronously from the model.
- They can overlap CPU preprocessing and GPU training.
- They can automatically tune performance using tf.data.AUTOTUNE.
- They can scale across devices (e.g., multiple GPUs or TPUs).
This is what makes them much faster and cleaner than writing your own Python loop like:
``` py
for x, y in dataset:
    model.train_on_batch(x, y)
```
That’s fine for small demos — but real-world ML needs throughput and reproducibility, which tf.data gives you.

### TF pipeline in context of NLP
When your data is text, the pipeline also does:
1. Reading raw text (from files, CSVs, TFRecords).
2. Tokenizing (turning words → integers).
3. Padding/truncating sequences.
4. Building attention masks or features.
5. Batching for model input.

Example:
```py
dataset = (
    tf.data.TextLineDataset("data.txt")
    .map(tokenize_and_pad, num_parallel_calls=tf.data.AUTOTUNE)
    .batch(32)
    .prefetch(tf.data.AUTOTUNE)
)
```

The tokenize_and_pad function might call:
- tf.strings.split() (basic whitespace)
- or a pretrained tokenizer like keras_nlp.tokenizers.WordPieceTokenizer
- or tf.py_function wrapping a Python tokenizer
That’s what “integrating a tokenizer into the pipeline” means — it becomes part of this conveyor belt.

### Why we care about things like cache(), prefetch(), AUTOTUNE
These are performance tuning knobs for your data loader:

| Function                  | Description                                           | Common Use                                      |
|---------------------------|-------------------------------------------------------|------------------------------------------------|
| cache()                  | Stores the preprocessed dataset in memory or on disk after first epoch. | When dataset fits in RAM or is small.          |
| prefetch()               | Loads the next batch while the model trains on the current one. | Always use with AUTOTUNE.                      |
| AUTOTUNE                 | Lets TF automatically pick parallelism/prefetch settings. | Default best practice.                         |
| map(num_parallel_calls=AUTOTUNE) | Runs preprocessing functions in parallel threads. | Speeds up CPU-bound steps like decoding/tokenizing. |

Together, these let TensorFlow stream data continuously to your GPU.

### How it connects to what you’ll build later
Eventually, your model training loop will look like this:
``` py
for x_batch, y_batch in dataset:
    with tf.GradientTape() as tape:
        preds = model(x_batch, training=True)
        loss = loss_fn(y_batch, preds)
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
```
The dataset in this loop is what the pipeline built.
It keeps producing ready-to-train batches infinitely or per epoch.

That’s why every TensorFlow engineer must master tf.data — it’s how you feed your models at scale.

Summary — “What are TensorFlow pipelines?”
| Concept         | Intuition                          | Analogy                                      |
|------------------|------------------------------------|----------------------------------------------|
| Pipeline         | The complete data flow from disk → ready tensors | A kitchen conveyor belt for data            |
| tf.data.Dataset  | TensorFlow object representing a pipeline | Recipe for data preparation                 |
| map()            | Transform each data sample        | Chop vegetables on the belt                 |
| batch()          | Group samples together            | Pack boxes of meals                         |
| prefetch()       | Get the next batch ready          | Chef preps next dish while plating current one |
| cache()          | Save processed data               | Store pre-chopped ingredients               |
| AUTOTUNE         | Auto-optimizes performance        | Smart chef who adjusts speed                |

NOTE: in scikit-learn, pipelines chain processing + modeling steps, while in TensorFlow, pipelines mainly handle data loading, preprocessing, and feeding efficiently to the model during training.

### Example: High-Level Text Classification Pipeline
Let’s build a clean pipeline for a text classification task — say, classifying IMDB movie reviews as positive or negative.


In [None]:
%pip install tensorflow

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds

# 1. Load dataset
# TensorFlow Datasets (TFDS) gives us ready-to-use data
train_ds, test_ds = tfds.load(
    "imdb_reviews",
    split=["train", "test"],
    as_supervised=True,  # returns (text, label)
)

# 2. Tokenization and TextVectorization layer, tokenizing is to convert text to numbers (vectors) and textVectorization is a layer that helps with that
# This is a built-in Keras preprocessing layer for text
vocab_size = 10000 # Limit vocabulary size to top 10,000 words to save memory this is 10000 words from the dataset
seq_length = 250 # Limit each review to 250 words

# a vectorization layer is created to handle the tokenization and vectorization of text data it converts text into sequences of integers using a fixed vocabulary size and sequence length the word vocab for the layer is learned from the training data
vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=seq_length
)

# You must "adapt" the layer to learn the vocabulary from the text
train_text = train_ds.map(lambda text, label: text)
vectorize_layer.adapt(train_text)

# 3. Preprocessing pipeline function
def preprocess_text(text, label):
    text = vectorize_layer(text)  # Apply the TextVectorization layer to the text this will convert the text to integer sequences from out learned vocabulary 
    return text, label # Return the processed text and label as a tuple

# 4. Apply preprocessing, shuffle, batch, and prefetch
batch_size = 32

# Apply preprocessing to the datasets and optimize them for performance
train_ds = (
    train_ds 
    .shuffle(10000) # Shuffle the dataset with a buffer size of 10,000 to ensure randomness
    .map(preprocess_text, num_parallel_calls=tf.data.AUTOTUNE) # Apply the preprocessing function in parallel
    .batch(batch_size) # Batch the data
    .prefetch(tf.data.AUTOTUNE) # Prefetch data for better performance AUTOTUNE lets TensorFlow decide the optimal number of batches to prefetch
)

# Apply preprocessing to the test dataset
test_ds = (
    test_ds
    .map(preprocess_text, num_parallel_calls=tf.data.AUTOTUNE) # Apply the preprocessing function in parallel
    .batch(batch_size) # Batch the data
    .prefetch(tf.data.AUTOTUNE) # Prefetch data for better performance
)

# 5. Build a simple model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, 64, input_length=seq_length), # Embedding layer to convert integer sequences to dense vectors of fixed size why because neural networks work better with dense vectors
    tf.keras.layers.GlobalAveragePooling1D(), # Global average pooling to reduce the sequence dimension pooling is a way to downsample the data means taking the average of all the elements in the sequence and reducing the dimensionality of the data here we reduce the dimentions of the sequence to a single vector our sequence is now represented by a single vector (our sequence here was 250 words long now its just one vector) this vector represents the entire review the embedding process is done by pooling function
    tf.keras.layers.Dense(64, activation="relu"), # A dense hidden layer with ReLU activation
    tf.keras.layers.Dense(1, activation="sigmoid") # Output layer for binary classification (positive/negative review)
])

# Compile the model
model.compile(
    optimizer="adam", # Adam optimizer is an efficient optimization algorithm that adjusts the learning rate during training we use the optimizer to apply gradients to the model's weights based on the loss function
    loss="binary_crossentropy", # Binary crossentropy loss function for binary classification tasks
    metrics=["accuracy"] # Track accuracy during training
)

# 6. Train the model — the dataset is already optimized
history = model.fit(train_ds, validation_data=test_ds, epochs=3)

# In this code example we built a complete TensorFlow data pipeline for text data using the IMDB reviews dataset we loaded the data, tokenized and vectorized the text using a TextVectorization layer, applied preprocessing, shuffling, batching, and prefetching to optimize performance finally we built and trained a simple neural network model for sentiment analysis on the preprocessed data
# what we mean by a data pipeline is a series of steps that process and prepare data for training machine learning models these steps typically include loading the data, preprocessing it (like tokenization and vectorization for text data), batching it into manageable sizes, and optimizing the data flow for performance during training
# in our case that part (the data pipeline part) was: loading the dataset, applying the TextVectorization layer, shuffling, batching, and prefetching the data
# this ensures that the data is in the right format and is efficiently fed into the model during training


Epoch 1/3
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 16ms/step - accuracy: 0.6535 - loss: 0.5926 - val_accuracy: 0.8550 - val_loss: 0.3496
Epoch 2/3
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 16ms/step - accuracy: 0.8736 - loss: 0.3008 - val_accuracy: 0.8558 - val_loss: 0.3340
Epoch 3/3
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 16ms/step - accuracy: 0.9040 - loss: 0.2412 - val_accuracy: 0.8660 - val_loss: 0.3235
