<a href="https://colab.research.google.com/github/KrituneX/Hands-on-Machine-Learning-with-Scikit-Learn-Keras-TensorFlow/blob/main/Chapter_13.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ringkasan Chapter 13: Loading and Preprocessing Data with TensorFlow
## Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd Ed.)

## 1. TensorFlow Data API: Teori Dasar

### Arsitektur Pipeline Data
$$ \text{DataSource} \rightarrow \text{Transformations} \rightarrow \text{Consumer} $$

TensorFlow Data API menggunakan pendekatan lazy evaluation:
$$ \text{Data Pipeline} = \{f_1 \circ f_2 \circ ... \circ f_n\}(\text{Dataset}) $$

**Komponen Utama**:
1. `Dataset`: Representasi abstrak dari aliran data
2. `Transformation`: Operasi pada dataset (map, filter, batch, dll.)
3. `Iterator`: Mekanisme untuk mengkonsumsi data

## 2. Operasi Dasar dengan TF Data API

### Transformasi Chaining
Rumus komposisi transformasi:
$$ \text{Dataset}' = \text{Dataset}.\text{map}(f).\text{batch}(n).\text{prefetch}(k) $$

**Contoh Implementasi**:

In [None]:
import tensorflow as tf

# Membuat pipeline data
dataset = tf.data.Dataset.range(100)
dataset = dataset.map(lambda x: x**2)  # Transformasi 1
dataset = dataset.batch(16)            # Transformasi 2
dataset = dataset.prefetch(1)          # Optimasi performa

for batch in dataset.take(3):
    print("Batch:", batch.numpy())

## 3. Shuffling dan Windowing

### Teori Shuffling
Rumus buffer shuffling:
$$ \text{Shuffle}(D, b) = \text{random_sample}(D, b) $$
dimana $b$ adalah ukuran buffer

### Window Transformation
Untuk data time series:
$$ \text{Window}(D, w, s) = \{D[i:i+w] \text{ for } i \in 0,s,2s,...\} $$

In [None]:
# Contoh shuffling dan windowing
dataset = tf.data.Dataset.range(100)
dataset = dataset.shuffle(buffer_size=50)
dataset = dataset.window(size=5, shift=1, drop_remainder=True)

for window in dataset.take(3):
    print([item.numpy() for item in window])

## 4. Preprocessing Data

### Normalisasi
$$ z = \frac{x - \mu}{\sigma} $$

### One-Hot Encoding
$$ \text{encode}(x) = [\mathbb{I}(x=k)]_{k=1}^K $$

**Pipeline Lengkap**:

In [None]:
def preprocess(features, label):
    # Normalisasi
    features = (features - tf.reduce_mean(features)) / tf.math.reduce_std(features)

    # One-hot encoding
    label = tf.one_hot(label, depth=10)

    return features, label

# Contoh dataset MNIST
(X_train, y_train), _ = tf.keras.datasets.mnist.load_data()
dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
dataset = dataset.map(preprocess).batch(32).prefetch(1)

for X_batch, y_batch in dataset.take(1):
    print("Batch shape:", X_batch.shape, y_batch.shape)

## 5. TFRecord Format

### Struktur TFRecord
$$ \text{Example} = \{ \text{feature}: \text{Feature}\} $$

**Feature Types**:
1. FloatList
2. Int64List
3. BytesList

**Contoh Pembuatan TFRecord**:

In [None]:
def write_tfrecord(images, labels, filename):
    with tf.io.TFRecordWriter(filename) as writer:
        for img, lbl in zip(images, labels):
            feature = {
                'image': tf.train.Feature(float_list=tf.train.FloatList(value=img.flatten())),
                'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[lbl]))
            }
            example = tf.train.Example(features=tf.train.Features(feature=feature))
            writer.write(example.SerializeToString())

# Contoh penggunaan
write_tfrecord(X_train[:100], y_train[:100], 'mnist_sample.tfrecord')

## 6. Parallel Data Loading

### Interleave Pattern
$$ \text{interleave}(D_1,...,D_n) = \text{round_robin}(D_1,...,D_n) $$

**Contoh Implementasi**:

In [None]:
files = ['data1.tfrecord', 'data2.tfrecord', 'data3.tfrecord']
dataset = tf.data.Dataset.from_tensor_slices(files)
dataset = dataset.interleave(
    lambda x: tf.data.TFRecordDataset(x),
    cycle_length=4,
    num_parallel_calls=tf.data.AUTOTUNE
)

## 7. Best Practices

1. **Pipeline Optimization**:
$$ \text{Throughput} = \frac{\text{Batch Size}}{\text{Step Time}} $$

2. **Cache Strategy**:
$$ \text{Dataset} = \text{Dataset}.\text{cache}() $$

3. **Prefetch Pattern**:
$$ \text{Dataset} = \text{Dataset}.\text{prefetch}(\text{buffer_size}=tf.data.AUTOTUNE) $$