# üìä Chapter 13: Loading and Preprocessing Data with TensorFlow
# Memuat dan Memproses Data dengan TensorFlow

---

## üéØ Tujuan Pembelajaran

Setelah menyelesaikan chapter ini, Anda akan mampu:
- ‚úÖ Memahami konsep TensorFlow Data API
- ‚úÖ Membuat dan memanipulasi dataset dengan tf.data
- ‚úÖ Memuat data dari berbagai sumber (CSV, TFRecord, dll)
- ‚úÖ Melakukan preprocessing data yang efisien
- ‚úÖ Mengoptimalkan performa data pipeline
- ‚úÖ Mengintegrasikan data pipeline dengan model Keras

---

## üìã Outline Chapter

1. **Data API - Konsep Dasar** üî∞
2. **Loading Data dari Files** üìÅ
3. **Data Preprocessing & Feature Engineering** üîß
4. **Performance Optimization** üöÄ
5. **TFRecord Format** üíæ
6. **Integrasi dengan Keras** ü§ù
7. **Best Practices & Tips** üí°

---

## üåü Pengantar

Chapter 13 ini membahas cara memuat dan memproses data secara efisien untuk sistem Deep Learning. Sejauh ini kita hanya menggunakan dataset yang muat di memori, namun sistem Deep Learning sering dilatih dengan dataset sangat besar yang tidak muat di RAM.

**TensorFlow Data API** menyediakan solusi untuk:
- üìä Memuat dataset besar secara efisien
- ‚ö° Memproses data dengan multithreading, queuing, batching, dan prefetching
- üîó Integrasi seamless dengan tf.keras
- üéØ Pipeline data yang scalable dan reproducible

**Dataset yang didukung:**
- üìÑ File teks (CSV, JSON)  
- üóÇÔ∏è File binary dengan record berukuran tetap
- üíæ TFRecord format (record berukuran variabel)
- üóÉÔ∏è Database SQL
- üåê Berbagai sumber data lainnya melalui ekstensi

In [1]:
# üîß Setup & Import Libraries
print("=" * 60)
print("üöÄ CHAPTER 13: Loading and Preprocessing Data with TensorFlow")
print("=" * 60)

import tensorflow as tf
import numpy as np
import pandas as pd
from tensorflow import keras
import matplotlib.pyplot as plt
import os
import tempfile
import time
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
tf.random.set_seed(42)
np.random.seed(42)

# Display versions
print(f"üì¶ TensorFlow version: {tf.__version__}")
print(f"üì¶ NumPy version: {np.__version__}")
print(f"üì¶ Pandas version: {pd.__version__}")

# Check GPU availability
if tf.config.list_physical_devices('GPU'):
    print("üéÆ GPU Available:", tf.config.list_physical_devices('GPU'))
else:
    print("üíª Running on CPU")

print("\n‚úÖ Setup complete! Ready to explore TensorFlow Data API")
print("=" * 60)

üöÄ CHAPTER 13: Loading and Preprocessing Data with TensorFlow
üì¶ TensorFlow version: 2.19.0
üì¶ NumPy version: 2.1.3
üì¶ Pandas version: 2.3.0
üíª Running on CPU

‚úÖ Setup complete! Ready to explore TensorFlow Data API
üì¶ TensorFlow version: 2.19.0
üì¶ NumPy version: 2.1.3
üì¶ Pandas version: 2.3.0
üíª Running on CPU

‚úÖ Setup complete! Ready to explore TensorFlow Data API


---

# üî∞ 1. Data API - Konsep Dasar

## üìö Pengantar tf.data

**tf.data API** adalah inti dari data loading di TensorFlow yang menyediakan:

- üîÑ **Dataset**: Abstraksi untuk sequence of elements
- ‚ö° **Transformations**: Map, filter, batch, shuffle, dll
- üöÄ **Performance**: Prefetching, caching, parallelization
- üîó **Integration**: Seamless dengan tf.keras

---

## 1.1 Membuat Dataset Sederhana üìä

Data API berpusat pada konsep **'Dataset'** yang merepresentasikan urutan item data. Dataset biasanya membaca data dari disk secara bertahap, namun untuk kesederhanaan kita mulai dengan dataset di RAM menggunakan `tf.data.Dataset.from_tensor_slices()`

### Metode Pembuatan Dataset:
- `tf.data.Dataset.from_tensor_slices()` - dari array/tensor
- `tf.data.Dataset.from_tensors()` - dari single tensor
- `tf.data.Dataset.range()` - range values
- `tf.data.Dataset.from_generator()` - dari generator function

In [2]:
# üìä 1.1 Membuat Dataset Sederhana dari Array
print("üî∞ BASIC DATASET CREATION")
print("=" * 50)

# 1. Dataset dari tensor slices (paling umum)
print("\nüìã 1. Dataset dari tensor slices:")
X = tf.range(10)
dataset = tf.data.Dataset.from_tensor_slices(X)
print(f"Dataset dari range(10): {list(dataset.as_numpy_iterator())}")

# 2. Dataset dari multiple arrays
print("\nüìã 2. Dataset dari multiple arrays:")
X = tf.range(5) 
Y = tf.range(10, 15)
dataset_xy = tf.data.Dataset.from_tensor_slices((X, Y))
print("Dataset dari (X, Y):")
for x, y in dataset_xy:
    print(f"  X: {x.numpy()}, Y: {y.numpy()}")

# 3. Dataset dengan dictionary (sangat berguna!)
print("\nüìã 3. Dataset dengan dictionary:")
dataset_dict = tf.data.Dataset.from_tensor_slices({
    "features": tf.random.normal((5, 3)),
    "labels": tf.range(5)
})
print("Dataset dictionary:")
for i, item in enumerate(dataset_dict):
    print(f"  Sample {i+1}:")
    print(f"    Features shape: {item['features'].shape}")
    print(f"    Label: {item['labels'].numpy()}")

# 4. Dataset dari single tensor
print("\nüìã 4. Dataset dari single tensor:")
tensor_data = tf.constant([[1, 2], [3, 4], [5, 6]])
dataset_tensor = tf.data.Dataset.from_tensors(tensor_data)
print("Dataset dari single tensor:")
for item in dataset_tensor:
    print(f"  Shape: {item.shape}, Values:\n{item.numpy()}")

# 5. Dataset range
print("\nüìã 5. Dataset range:")
range_dataset = tf.data.Dataset.range(5)
print(f"Range dataset: {list(range_dataset.as_numpy_iterator())}")

print("\n‚úÖ Basic dataset creation complete!")
print("=" * 50)

üî∞ BASIC DATASET CREATION

üìã 1. Dataset dari tensor slices:
Dataset dari range(10): [np.int32(0), np.int32(1), np.int32(2), np.int32(3), np.int32(4), np.int32(5), np.int32(6), np.int32(7), np.int32(8), np.int32(9)]

üìã 2. Dataset dari multiple arrays:
Dataset dari (X, Y):
  X: 0, Y: 10
  X: 1, Y: 11
  X: 2, Y: 12
  X: 3, Y: 13
  X: 4, Y: 14

üìã 3. Dataset dengan dictionary:
Dataset dictionary:
  Sample 1:
    Features shape: (3,)
    Label: 0
  Sample 2:
    Features shape: (3,)
    Label: 1
  Sample 3:
    Features shape: (3,)
    Label: 2
  Sample 4:
    Features shape: (3,)
    Label: 3
  Sample 5:
    Features shape: (3,)
    Label: 4

üìã 4. Dataset dari single tensor:
Dataset dari single tensor:
  Shape: (3, 2), Values:
[[1 2]
 [3 4]
 [5 6]]

üìã 5. Dataset range:
Range dataset: [np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4)]

‚úÖ Basic dataset creation complete!


---

## 1.2 Transformasi Dataset üîÑ

Dataset memiliki berbagai **method transformasi** yang dapat di-chain untuk membangun data pipeline yang powerful:

### üõ†Ô∏è Core Transformations:
- **`map(func)`** - Menerapkan fungsi ke setiap elemen  
- **`filter(predicate)`** - Menyaring elemen berdasarkan kondisi
- **`batch(batch_size)`** - Mengelompokkan elemen dalam batch
- **`shuffle(buffer_size)`** - Mengacak urutan elemen
- **`repeat(count)`** - Mengulang dataset
- **`take(count)`** - Mengambil n elemen pertama
- **`skip(count)`** - Melewati n elemen pertama
- **`cache(filename)`** - Cache data di memory/disk
- **`prefetch(buffer_size)`** - Load data di background

### üîó Method Chaining:
Transformasi dapat di-chain menggunakan fluent interface:
```python
dataset = (tf.data.Dataset.from_tensor_slices(data)
           .map(preprocess_func)
           .filter(lambda x: x > 0)
           .shuffle(1000)
           .batch(32)
           .prefetch(tf.data.AUTOTUNE))
```

In [None]:
# üîÑ 1.2 Demonstrasi Transformasi Dataset
print("üîÑ DATASET TRANSFORMATIONS")
print("=" * 50)

# Dataset awal untuk demo
base_data = tf.range(12)
print(f"üìä Base dataset: {list(base_data.as_numpy_iterator())}")

print("\nüõ†Ô∏è Individual Transformations:")
print("-" * 30)

# 1. MAP - Transform setiap elemen
print("\n1Ô∏è‚É£ MAP Transformation:")
mapped = tf.data.Dataset.from_tensor_slices(base_data).map(lambda x: x ** 2)
print(f"   Original: {list(base_data.as_numpy_iterator())}")
print(f"   Squared:  {list(mapped.as_numpy_iterator())}")

# 2. FILTER - Saring elemen
print("\n2Ô∏è‚É£ FILTER Transformation:")
filtered = tf.data.Dataset.from_tensor_slices(base_data).filter(lambda x: x % 3 == 0)
print(f"   Original: {list(base_data.as_numpy_iterator())}")
print(f"   Divisible by 3: {list(filtered.as_numpy_iterator())}")

# 3. BATCH - Kelompokkan dalam batch
print("\n3Ô∏è‚É£ BATCH Transformation:")
batched = tf.data.Dataset.from_tensor_slices(base_data).batch(4)
print("   Batched data:")
for i, batch in enumerate(batched):
    print(f"     Batch {i+1}: {batch.numpy()}")

# 4. SHUFFLE - Acak data
print("\n4Ô∏è‚É£ SHUFFLE Transformation:")
shuffled = (tf.data.Dataset.from_tensor_slices(base_data)
            .shuffle(buffer_size=12, seed=42))
print(f"   Original: {list(base_data.as_numpy_iterator())}")
print(f"   Shuffled: {list(shuffled.as_numpy_iterator())}")

print("\nüîó CHAINED TRANSFORMATIONS:")
print("-" * 30)

# Complex pipeline dengan chaining
result = (tf.data.Dataset.from_tensor_slices(base_data)
          .filter(lambda x: x < 10)          # Filter: x < 10
          .map(lambda x: x * 2)              # Map: multiply by 2  
          .shuffle(buffer_size=20, seed=42)   # Shuffle
          .batch(3)                          # Batch size 3
          .take(2))                          # Take first 2 batches

print("Pipeline: filter(x<10) ‚Üí map(x*2) ‚Üí shuffle ‚Üí batch(3) ‚Üí take(2)")
for i, batch in enumerate(result):
    print(f"   Batch {i+1}: {batch.numpy()}")

print("\n‚úÖ Dataset transformations complete!")
print("=" * 50)

---

# üìÅ 2. Loading Data dari Files

## üóÇÔ∏è Sumber Data yang Didukung

TensorFlow Data API mendukung berbagai format file:

### üìä Structured Data:
- **CSV Files** - `tf.data.experimental.make_csv_dataset()`
- **JSON Files** - Custom parsing dengan `TextLineDataset`
- **Parquet Files** - Via TensorFlow I/O

### üíæ Binary Data:
- **TFRecord** - `tf.data.TFRecordDataset()` (format native TF)
- **Fixed-length records** - `tf.data.FixedLengthRecordDataset()`
- **Raw binary** - `tf.data.RawRecordDataset()`

### üìÑ Text Data:
- **Text files** - `tf.data.TextLineDataset()`
- **Image files** - `tf.data.Dataset.list_files()` + `tf.io.read_file()`

---

## 2.1 CSV Files üìä

CSV adalah format paling umum untuk structured data. TensorFlow menyediakan 2 pendekatan:
1. **High-level**: `make_csv_dataset()` - otomatis parsing
2. **Low-level**: `TextLineDataset()` - kontrol manual

In [None]:
# üìä 2.1 Loading CSV Data - Practical Examples
print("üìÅ LOADING CSV DATA")
print("=" * 50)

# Create sample CSV data
csv_content = """longitude,latitude,housing_median_age,total_rooms,population,median_income,price
-122.23,37.88,41.0,880.0,322.0,8.3252,452600.0
-122.22,37.86,21.0,1106.0,2401.0,8.3014,358500.0
-122.24,37.85,52.0,1467.0,496.0,7.2574,352100.0
-122.25,37.85,52.0,1274.0,558.0,5.6431,341300.0
-122.25,37.85,52.0,1627.0,565.0,3.8462,342200.0"""

# Write to temporary file
temp_file = tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False)
temp_file.write(csv_content)
temp_file.close()
print(f"üìù Created sample CSV: {os.path.basename(temp_file.name)}")

print("\nüéØ METHOD 1: make_csv_dataset (Recommended)")
print("-" * 40)

# High-level CSV loading
csv_dataset = tf.data.experimental.make_csv_dataset(
    temp_file.name,
    batch_size=2,
    label_name="price",           # Target column
    na_value="?",                 # Missing value indicator
    num_epochs=1,                 # Number of epochs
    ignore_errors=True,           # Skip problematic rows
    shuffle=False                 # Keep order for demo
)

print("‚úÖ Dataset created with automatic type inference")
print("üìã Sample data:")
for batch_num, (features, labels) in enumerate(csv_dataset.take(2)):
    print(f"\n   Batch {batch_num + 1}:")
    print(f"   Features: {list(features.keys())}")
    for key, values in features.items():
        print(f"     {key}: {values.numpy()}")
    print(f"   Labels (price): {labels.numpy()}")

print("\nüîß METHOD 2: TextLineDataset (Manual Control)")
print("-" * 40)

# Low-level manual parsing
def parse_csv_line(line):
    """Parse a single CSV line"""
    # Define default values (for type inference)
    defaults = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]  # All float
    fields = tf.io.decode_csv(line, defaults)
    
    # Features (all except last column)
    features = tf.stack(fields[:-1])
    # Label (last column) 
    label = fields[-1]
    
    return features, label

# Create dataset from text lines
text_dataset = tf.data.TextLineDataset(temp_file.name)
text_dataset = text_dataset.skip(1)  # Skip header
parsed_dataset = text_dataset.map(parse_csv_line)

print("‚úÖ Manual parsing complete")
print("üìã Parsed data samples:")
for i, (features, label) in enumerate(parsed_dataset.take(2)):
    print(f"   Sample {i+1}: Features={features.numpy()}, Price={label.numpy()}")

print("\nüîÑ PIPELINE-READY DATASET:")
print("-" * 25)

# Create training-ready pipeline
training_dataset = (parsed_dataset
                   .shuffle(buffer_size=100, seed=42)
                   .batch(2)
                   .prefetch(tf.data.AUTOTUNE))

print("‚úÖ Training pipeline: shuffle ‚Üí batch ‚Üí prefetch")
for batch in training_dataset.take(1):
    features, labels = batch
    print(f"   Batch shape: Features {features.shape}, Labels {labels.shape}")

# Cleanup
os.unlink(temp_file.name)
print(f"\nüßπ Cleaned up temporary file")
print("=" * 50)

# üîß 3. Data Preprocessing & Feature Engineering

## üéØ Mengapa Preprocessing Penting?

Deep learning models membutuhkan data yang **terstandarisasi** dan **bersih**:

### üìä Numerical Data:
- **Normalisasi** - Scale data ke range [0,1]: `(x - min) / (max - min)`
- **Standardisasi** - Zero mean, unit variance: `(x - mean) / std`
- **Robust scaling** - Menggunakan median dan IQR

### üìù Text Data:
- **Tokenization** - Split text menjadi tokens
- **Vocabulary mapping** - Convert tokens ke integers
- **Padding** - Uniform sequence length
- **Embedding** - Dense vector representation

### üñºÔ∏è Image Data:
- **Normalization** - Pixel values [0,1] atau [-1,1]
- **Resize** - Uniform image dimensions
- **Augmentation** - Rotation, flip, crop, dll

---

## 3.1 Numerical Preprocessing üî¢

Teknik preprocessing untuk data numerik:
- **Normalisasi**: Mengubah skala data ke range [0,1]  
- **Standardisasi**: Mengubah data memiliki mean=0 dan std=1
- **Robust scaling**: Menggunakan median dan IQR

## 3.2 Text Preprocessing üìö

Untuk data teks, kita perlu:
- **Tokenisasi**: Memecah teks menjadi token-token
- **Pemetaan Kosakata**: Mengonversi token menjadi bilangan bulat
- **Padding**: Menyamakan panjang urutan
- **Penyematan**: Representasi vektor yang padat

In [None]:
# üîß Data Preprocessing Examples
print("=== DATA PREPROCESSING ===")

# Sample data
data = tf.constant([
    [1.0, 100.0, 0.5],  
    [2.0, 200.0, 1.5],
    [3.0, 150.0, 2.0],
    [4.0, 50.0, 0.8]
])

print("Original data:")
print(data.numpy())

# 1. Normalization (Min-Max Scaling)
print("\n=== NORMALIZATION (Min-Max) ===")
def normalize_minmax(data):
    min_vals = tf.reduce_min(data, axis=0)
    max_vals = tf.reduce_max(data, axis=0)
    return (data - min_vals) / (max_vals - min_vals)

normalized_data = normalize_minmax(data)
print("Normalized data (0-1 range):")
print(normalized_data.numpy())

# 2. Standardization (Z-score)
print("\n=== STANDARDIZATION (Z-score) ===")
def standardize(data):
    mean = tf.reduce_mean(data, axis=0)
    std = tf.math.reduce_std(data, axis=0)
    return (data - mean) / std

standardized_data = standardize(data)
print("Standardized data (mean=0, std=1):")
print(standardized_data.numpy())
print("Mean:", tf.reduce_mean(standardized_data, axis=0).numpy())
print("Std:", tf.math.reduce_std(standardized_data, axis=0).numpy())

# 3. Feature Engineering - Polynomial Features
print("\n=== FEATURE ENGINEERING ===")
def create_polynomial_features(data):
    # Create interaction features
    x1, x2, x3 = tf.split(data, 3, axis=1)
    
    # Original features + polynomial features
    features = tf.concat([
        data,                    # Original features
        x1 * x2,                # Interaction x1*x2
        x1 * x3,                # Interaction x1*x3  
        x2 * x3,                # Interaction x2*x3
        tf.square(x1),          # x1¬≤
        tf.square(x2),          # x2¬≤
        tf.square(x3)           # x3¬≤
    ], axis=1)
    
    return features

poly_features = create_polynomial_features(data)
print("Original + Polynomial features:")
print(f"Shape: {data.shape} ‚Üí {poly_features.shape}")
print("First sample:")
print(f"  Original: {data[0].numpy()}")
print(f"  Enhanced: {poly_features[0].numpy()}")

# üî¢ 3.1 Numerical Data Preprocessing
print("üîß NUMERICAL PREPROCESSING")
print("=" * 50)

# Sample numerical data (different scales)
raw_data = tf.constant([
    [1.0, 100.0, 0.5, 1000.0],    # Mixed scales
    [2.0, 200.0, 1.5, 2000.0],
    [3.0, 150.0, 2.0, 1500.0],
    [4.0, 50.0, 0.8, 800.0],
    [5.0, 300.0, 1.2, 3000.0]
], dtype=tf.float32)

print("üìä Original data (mixed scales):")
print(raw_data.numpy())
print(f"   Min values: {tf.reduce_min(raw_data, axis=0).numpy()}")
print(f"   Max values: {tf.reduce_max(raw_data, axis=0).numpy()}")

print("\n1Ô∏è‚É£ MIN-MAX NORMALIZATION [0,1]")
print("-" * 35)

def normalize_minmax(data):
    """Min-Max normalization to [0,1] range"""
    min_vals = tf.reduce_min(data, axis=0)
    max_vals = tf.reduce_max(data, axis=0)
    # Avoid division by zero
    range_vals = tf.maximum(max_vals - min_vals, 1e-8)
    return (data - min_vals) / range_vals

normalized_data = normalize_minmax(raw_data)
print("‚úÖ Normalized data [0,1]:")
print(normalized_data.numpy())
print(f"   New min: {tf.reduce_min(normalized_data, axis=0).numpy()}")
print(f"   New max: {tf.reduce_max(normalized_data, axis=0).numpy()}")

print("\n2Ô∏è‚É£ Z-SCORE STANDARDIZATION")
print("-" * 30)

def standardize_zscore(data):
    """Z-score standardization (mean=0, std=1)"""
    mean = tf.reduce_mean(data, axis=0)
    std = tf.math.reduce_std(data, axis=0)
    # Avoid division by zero
    std = tf.maximum(std, 1e-8)
    return (data - mean) / std

standardized_data = standardize_zscore(raw_data)
print("‚úÖ Standardized data (Œº=0, œÉ=1):")
print(standardized_data.numpy())
print(f"   New mean: {tf.reduce_mean(standardized_data, axis=0).numpy()}")
print(f"   New std:  {tf.math.reduce_std(standardized_data, axis=0).numpy()}")

print("\n3Ô∏è‚É£ FEATURE ENGINEERING")
print("-" * 25)

def create_polynomial_features(data):
    """Create polynomial and interaction features"""
    # Original features
    original_features = data
    
    # Polynomial features (squares)
    squared_features = tf.square(data)
    
    # Interaction features (pairwise products)
    # For demo, just first two columns
    interaction = tf.expand_dims(data[:, 0] * data[:, 1], axis=1)
    
    # Combine all features
    enhanced_features = tf.concat([
        original_features,      # Original
        squared_features,       # x¬≤
        interaction            # x‚ÇÅ √ó x‚ÇÇ
    ], axis=1)
    
    return enhanced_features

enhanced_data = create_polynomial_features(raw_data)
print("‚úÖ Enhanced features:")
print(f"   Original shape: {raw_data.shape}")
print(f"   Enhanced shape: {enhanced_data.shape}")
print("   First sample enhanced:")
print(f"     Original: {raw_data[0].numpy()}")
print(f"     Enhanced: {enhanced_data[0].numpy()}")

print("\nüîÑ DATASET INTEGRATION")
print("-" * 22)

# Apply preprocessing to a dataset
def preprocess_fn(data):
    """Preprocessing function for dataset.map()"""
    return standardize_zscore(data)

# Create dataset and apply preprocessing
dataset = tf.data.Dataset.from_tensor_slices(raw_data)
preprocessed_dataset = dataset.map(preprocess_fn)

print("‚úÖ Preprocessing applied to dataset:")
for i, (original, processed) in enumerate(zip(dataset.take(2), preprocessed_dataset.take(2))):
    print(f"   Sample {i+1}:")
    print(f"     Before: {original.numpy()}")
    print(f"     After:  {processed.numpy()}")

print("\n‚úÖ Numerical preprocessing complete!")
print("=" * 50)

In [None]:
# üìù 3.2 Text Data Preprocessing
print("üìù TEXT PREPROCESSING")
print("=" * 50)

# Sample text data
texts = [
    "I love machine learning and deep learning",
    "TensorFlow is an amazing framework", 
    "Natural language processing is fascinating",
    "Deep neural networks are powerful",
    "Data science and AI are the future"
]

print("üìã Original texts:")
for i, text in enumerate(texts, 1):
    print(f"   {i}. {text}")

print("\n1Ô∏è‚É£ TOKENIZATION & VOCABULARY")
print("-" * 35)

# Create tokenizer
tokenizer = tf.keras.preprocessing.text.Tokenizer(
    num_words=50,           # Vocabulary size
    oov_token="<OOV>",      # Out-of-vocabulary token
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'  # Characters to filter
)

# Fit on texts
tokenizer.fit_on_texts(texts)

print(f"‚úÖ Vocabulary size: {len(tokenizer.word_index)}")
print("üìö Top 10 words in vocabulary:")
for word, idx in list(tokenizer.word_index.items())[:10]:
    print(f"     '{word}': {idx}")

print("\n2Ô∏è‚É£ TEXT TO SEQUENCES")
print("-" * 22)

# Convert texts to sequences
sequences = tokenizer.texts_to_sequences(texts)
print("‚úÖ Text ‚Üí Sequences conversion:")
for i, (text, seq) in enumerate(zip(texts, sequences)):
    print(f"   {i+1}. '{text[:30]}...' ‚Üí {seq}")

print("\n3Ô∏è‚É£ SEQUENCE PADDING")
print("-" * 20)

# Pad sequences to uniform length
max_length = 8
padded_sequences = tf.keras.preprocessing.sequence.pad_sequences(
    sequences, 
    maxlen=max_length, 
    padding='post',      # Pad at the end
    truncating='post'    # Truncate at the end
)

print(f"‚úÖ Padded sequences (max_length={max_length}):")
for i, (original, padded) in enumerate(zip(sequences, padded_sequences)):
    print(f"   {i+1}. {original} ‚Üí {padded}")

print("\n4Ô∏è‚É£ TENSORFLOW DATASET INTEGRATION")
print("-" * 35)

# Create TensorFlow dataset
text_dataset = tf.data.Dataset.from_tensor_slices(texts)
padded_dataset = tf.data.Dataset.from_tensor_slices(padded_sequences)

# Combine text and sequences
combined_dataset = tf.data.Dataset.zip((text_dataset, padded_dataset))

print("‚úÖ Text dataset created:")
for text, sequence in combined_dataset.take(2):
    print(f"   Text: {text.numpy().decode('utf-8')}")
    print(f"   Sequence: {sequence.numpy()}")
    print()

print("\n5Ô∏è‚É£ MODERN APPROACH: TextVectorization")
print("-" * 35)

# TensorFlow 2.x way (more efficient)
vectorizer = tf.keras.layers.TextVectorization(
    max_tokens=50,
    output_sequence_length=max_length,
    output_mode='int'
)

# Adapt to text data
vectorizer.adapt(texts)

print("‚úÖ TextVectorization layer created")
print(f"üìö Vocabulary size: {vectorizer.vocabulary_size()}")

# Apply vectorization
vectorized_texts = vectorizer(texts)
print("üìã Vectorized texts:")
for i, vec in enumerate(vectorized_texts.numpy()):
    print(f"   {i+1}. {vec}")

print("\nüîÑ COMPLETE TEXT PIPELINE")
print("-" * 25)

# Create complete preprocessing pipeline
def text_preprocessing_pipeline(text_data, max_tokens=50, max_length=8):
    """Complete text preprocessing pipeline"""
    
    # Create and adapt vectorizer
    vectorizer = tf.keras.layers.TextVectorization(
        max_tokens=max_tokens,
        output_sequence_length=max_length,
        output_mode='int'
    )
    vectorizer.adapt(text_data)
    
    # Create dataset
    dataset = tf.data.Dataset.from_tensor_slices(text_data)
    
    # Apply vectorization
    vectorized_dataset = dataset.map(
        lambda x: vectorizer(x),
        num_parallel_calls=tf.data.AUTOTUNE
    )
    
    return vectorized_dataset, vectorizer

# Apply complete pipeline
processed_dataset, text_vectorizer = text_preprocessing_pipeline(texts)

print("‚úÖ Complete pipeline applied:")
for i, processed_text in enumerate(processed_dataset.take(2)):
    print(f"   Sample {i+1}: {processed_text.numpy()}")

print("\n‚úÖ Text preprocessing complete!")
print("=" * 50)

# üöÄ 4. Performance Optimization

## ‚ö° Mengapa Optimasi Penting?

Data loading sering menjadi **bottleneck** dalam deep learning training. Tanpa optimasi yang tepat:
- üêå Model menunggu data (GPU idle)
- üí∞ Pemborosan resource komputasi
- ‚è±Ô∏è Training time sangat lama

## üõ†Ô∏è Teknik Optimasi Utama

### 1. **Prefetching** üîÑ
Load data di background saat model training
```python
dataset = dataset.prefetch(tf.data.AUTOTUNE)
```

### 2. **Caching** üíæ
Simpan preprocessed data di memory/disk
```python
dataset = dataset.cache()  # Memory cache
dataset = dataset.cache('/path/to/cache')  # Disk cache
```

### 3. **Parallelization** üîÄ
Gunakan multiple cores untuk preprocessing
```python
dataset = dataset.map(preprocess_fn, num_parallel_calls=tf.data.AUTOTUNE)
```

### 4. **Vectorization** üìä
Batch operations lebih efisien dari single-item operations

---

## 4.1 Urutan Optimasi yang Tepat ‚úÖ

**Recommended Order:**
1. `shuffle()` (untuk dataset kecil)
2. `map()` (preprocessing)  
3. `cache()` (jika memori cukup)
4. `batch()`
5. `prefetch()`

**‚ùå Avoid:** Shuffle setelah batch, cache sebelum expensive operations

In [None]:
# üöÄ 4.1 Performance Optimization Examples
print("üöÄ PERFORMANCE OPTIMIZATION")
print("=" * 50)

# Create synthetic dataset for performance testing
def create_synthetic_data(n_samples=1000):
    """Create synthetic dataset for performance testing"""
    data = tf.random.normal((n_samples, 100))  # 100 features
    labels = tf.random.uniform((n_samples,), maxval=2, dtype=tf.int32)
    return tf.data.Dataset.from_tensor_slices((data, labels))

# Expensive preprocessing simulation
def expensive_preprocessing(features, label):
    """Simulate expensive preprocessing operation"""
    # Simulate computational cost
    processed_features = tf.nn.l2_normalize(features, axis=0)
    processed_features = tf.math.sin(processed_features) * tf.math.cos(processed_features)
    return processed_features, label

print("üìä Created synthetic dataset (1000 samples, 100 features)")

print("\n‚ùå BAD PIPELINE (Unoptimized)")
print("-" * 35)

# Bad pipeline - no optimization
bad_pipeline = (create_synthetic_data()
                .map(expensive_preprocessing)
                .batch(32))

print("‚úÖ Bad pipeline structure:")
print("   data ‚Üí map(expensive) ‚Üí batch")
print("   Issues: No caching, no prefetching, no parallelization")

print("\n‚úÖ GOOD PIPELINE (Optimized)")
print("-" * 32)

# Good pipeline - fully optimized
good_pipeline = (create_synthetic_data()
                 .map(expensive_preprocessing, 
                      num_parallel_calls=tf.data.AUTOTUNE)  # Parallel processing
                 .cache()                                   # Cache processed data
                 .shuffle(buffer_size=1000)                 # Shuffle
                 .batch(32)                                 # Batch
                 .prefetch(tf.data.AUTOTUNE))              # Prefetch

print("‚úÖ Good pipeline structure:")
print("   data ‚Üí map(expensive, parallel) ‚Üí cache ‚Üí shuffle ‚Üí batch ‚Üí prefetch")

print("\n‚ö° PERFORMANCE COMPARISON")
print("-" * 28)

def time_dataset(dataset, name, num_batches=10):
    """Time dataset iteration"""
    print(f"\nüïê Timing {name}:")
    
    start_time = time.time()
    for i, batch in enumerate(dataset.take(num_batches)):
        if i % 5 == 0:
            print(f"   Processed batch {i+1}")
    
    elapsed = time.time() - start_time
    print(f"   ‚è±Ô∏è Time: {elapsed:.2f}s ({elapsed/num_batches:.3f}s per batch)")
    return elapsed

# Time both pipelines
bad_time = time_dataset(bad_pipeline, "Bad Pipeline", 10)
good_time = time_dataset(good_pipeline, "Good Pipeline", 10)

improvement = (bad_time - good_time) / bad_time * 100
print(f"\nüéØ PERFORMANCE IMPROVEMENT: {improvement:.1f}%")

print("\nüí° OPTIMIZATION TECHNIQUES BREAKDOWN")
print("-" * 38)

print("1Ô∏è‚É£ AUTOTUNE - Automatic optimization")
print("   tf.data.AUTOTUNE automatically determines optimal values")
print("   for buffer_size, num_parallel_calls, etc.")

print("\n2Ô∏è‚É£ PREFETCHING - Overlap computation")
optimized_for_prefetch = (tf.data.Dataset.range(100)
                         .map(lambda x: tf.cast(x, tf.float32))
                         .batch(10)
                         .prefetch(tf.data.AUTOTUNE))

print("   ‚úÖ Prefetch added - data loading overlaps with training")

print("\n3Ô∏è‚É£ CACHING - Avoid recomputation")
cached_dataset = (tf.data.Dataset.range(100)
                 .map(lambda x: x ** 2)  # Expensive operation
                 .cache()                # Cache results
                 .batch(10))

print("   ‚úÖ Cache added - expensive operations computed once")

print("\n4Ô∏è‚É£ PARALLEL MAP - Use multiple cores")
parallel_dataset = (tf.data.Dataset.range(100)
                   .map(lambda x: tf.math.sin(tf.cast(x, tf.float32)),
                        num_parallel_calls=tf.data.AUTOTUNE)
                   .batch(10))

print("   ‚úÖ Parallel processing - utilizes multiple CPU cores")

print("\nüîß MEMORY OPTIMIZATION TIPS")
print("-" * 28)
print("üíæ For large datasets:")
print("   - Use cache() only if data fits in memory")
print("   - Consider disk caching: cache('/path/to/cache')")
print("   - Use prefetch() to overlap I/O with computation")
print("   - Batch after expensive operations")

print("\n‚úÖ Performance optimization complete!")
print("=" * 50)

---

# üíæ 5. TFRecord Format

## üéØ Mengapa TFRecord?

**TFRecord** adalah binary format native TensorFlow dengan keunggulan:

### ‚úÖ Keuntungan:
- **üöÄ Performa** - Loading 2-3x lebih cepat dari CSV
- **üì¶ Kompresi** - Built-in compression (GZIP, ZLIB)
- **üîÑ Efisiensi** - Optimal untuk streaming data besar
- **üèóÔ∏è Fleksibilitas** - Mendukung data kompleks (nested, variable-length)
- **‚ö° Integrasi** - Perfect dengan tf.data pipeline

### üìã Struktur TFRecord:
```
TFRecord File
‚îú‚îÄ‚îÄ Example 1 (Protocol Buffer)
‚îÇ   ‚îú‚îÄ‚îÄ Feature 1 (bytes/float/int64)
‚îÇ   ‚îú‚îÄ‚îÄ Feature 2 (bytes/float/int64)
‚îÇ   ‚îî‚îÄ‚îÄ ...
‚îú‚îÄ‚îÄ Example 2
‚îî‚îÄ‚îÄ ...
```

---

## 5.1 Creating & Reading TFRecord üõ†Ô∏è

TFRecord menggunakan **Protocol Buffers** untuk serialisasi data yang efisien.

In [None]:
# üíæ 5.1 TFRecord Complete Example
print("üíæ TFRECORD FORMAT")
print("=" * 50)

# Helper functions for TFRecord creation
def _bytes_feature(value):
    """Returns a bytes_list from a string / byte."""
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy()
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
    """Returns a float_list from a float / double."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
    """Returns an int64_list from a bool / enum / int / uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def _float_list_feature(values):
    """Returns a float_list from a list of floats."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=values))

print("üîß Helper functions created for TFRecord serialization")

print("\n1Ô∏è‚É£ CREATING TFRECORD FILE")
print("-" * 28)

# Sample structured data
samples = [
    {
        'id': 1,
        'features': [1.0, 2.0, 3.0, 4.0],
        'label': 'positive',
        'score': 0.95
    },
    {
        'id': 2,
        'features': [5.0, 6.0, 7.0, 8.0],
        'label': 'negative', 
        'score': 0.12
    },
    {
        'id': 3,
        'features': [9.0, 10.0, 11.0, 12.0],
        'label': 'positive',
        'score': 0.87
    }
]

# Create TFRecord file
tfrecord_filename = 'sample_data.tfrecord'

with tf.io.TFRecordWriter(tfrecord_filename) as writer:
    for sample in samples:
        # Create features dictionary
        feature_dict = {
            'id': _int64_feature(sample['id']),
            'features': _float_list_feature(sample['features']),
            'label': _bytes_feature(sample['label'].encode('utf-8')),
            'score': _float_feature(sample['score'])
        }
        
        # Create example
        example = tf.train.Example(
            features=tf.train.Features(feature=feature_dict)
        )
        
        # Write to file
        writer.write(example.SerializeToString())

print(f"‚úÖ Created TFRecord: {tfrecord_filename}")
print(f"   üìä Contains {len(samples)} examples")
print(f"   üíæ File size: {os.path.getsize(tfrecord_filename)} bytes")

print("\n2Ô∏è‚É£ READING TFRECORD FILE")
print("-" * 25)

# Define feature parsing schema
feature_description = {
    'id': tf.io.FixedLenFeature([], tf.int64),
    'features': tf.io.VarLenFeature(tf.float32),
    'label': tf.io.FixedLenFeature([], tf.string),
    'score': tf.io.FixedLenFeature([], tf.float32)
}

def parse_tfrecord(example_proto):
    """Parse TFRecord example"""
    parsed = tf.io.parse_single_example(example_proto, feature_description)
    
    # Convert sparse tensor to dense
    parsed['features'] = tf.sparse.to_dense(parsed['features'])
    
    return parsed

# Read TFRecord dataset
tfrecord_dataset = tf.data.TFRecordDataset(tfrecord_filename)
parsed_dataset = tfrecord_dataset.map(parse_tfrecord)

print("‚úÖ TFRecord dataset loaded and parsed")
print("üìã Sample data:")
for i, record in enumerate(parsed_dataset):
    print(f"\n   Example {i+1}:")
    print(f"     ID: {record['id'].numpy()}")
    print(f"     Features: {record['features'].numpy()}")
    print(f"     Label: {record['label'].numpy().decode('utf-8')}")
    print(f"     Score: {record['score'].numpy():.3f}")

print("\n3Ô∏è‚É£ TRAINING-READY PIPELINE")
print("-" * 30)

def prepare_for_training(record):
    """Prepare parsed record for training"""
    features = record['features']
    # Convert string label to integer
    label = tf.cond(
        tf.equal(record['label'], b'positive'),
        lambda: tf.constant(1, dtype=tf.int32),
        lambda: tf.constant(0, dtype=tf.int32)
    )
    return features, label

# Create training pipeline
training_pipeline = (parsed_dataset
                    .map(prepare_for_training)
                    .shuffle(buffer_size=100)
                    .batch(2)
                    .prefetch(tf.data.AUTOTUNE))

print("‚úÖ Training pipeline created:")
print("   TFRecord ‚Üí parse ‚Üí prepare ‚Üí shuffle ‚Üí batch ‚Üí prefetch")

print("\nüìä Training batches:")
for i, (batch_features, batch_labels) in enumerate(training_pipeline):
    print(f"   Batch {i+1}:")
    print(f"     Features shape: {batch_features.shape}")
    print(f"     Labels: {batch_labels.numpy()}")
    print(f"     Features: {batch_features.numpy()}")

print("\n4Ô∏è‚É£ COMPRESSION BENEFITS")
print("-" * 25)

# Create compressed TFRecord
compressed_filename = 'sample_data_compressed.tfrecord'
options = tf.io.TFRecordOptions(compression_type="GZIP")

with tf.io.TFRecordWriter(compressed_filename, options=options) as writer:
    for sample in samples:
        feature_dict = {
            'id': _int64_feature(sample['id']),
            'features': _float_list_feature(sample['features']),
            'label': _bytes_feature(sample['label'].encode('utf-8')),
            'score': _float_feature(sample['score'])
        }
        
        example = tf.train.Example(
            features=tf.train.Features(feature=feature_dict)
        )
        writer.write(example.SerializeToString())

# Compare file sizes
original_size = os.path.getsize(tfrecord_filename)
compressed_size = os.path.getsize(compressed_filename)
compression_ratio = (1 - compressed_size / original_size) * 100

print(f"‚úÖ Compression results:")
print(f"   üìÑ Original: {original_size} bytes")
print(f"   üì¶ Compressed: {compressed_size} bytes")
print(f"   üíæ Compression: {compression_ratio:.1f}% reduction")

# Cleanup
os.unlink(tfrecord_filename)
os.unlink(compressed_filename)
print(f"\nüßπ Files cleaned up")

print("\n‚úÖ TFRecord examples complete!")
print("=" * 50)

---

# ü§ù 6. Integrasi dengan Keras

## üîó Seamless Integration

tf.data Dataset dapat langsung digunakan dengan Keras:

```python
# Dataset training
train_dataset = (tf.data.Dataset.from_tensor_slices((X_train, y_train))
                .shuffle(1000)
                .batch(32)
                .prefetch(tf.data.AUTOTUNE))

# Train model
model.fit(train_dataset, epochs=10)
```

### üéØ Keuntungan:
- ‚úÖ No need untuk manual batching
- ‚úÖ Automatic prefetching
- ‚úÖ Memory efficient untuk dataset besar
- ‚úÖ Reproducible dengan random seeds

In [None]:
# ü§ù 6.1 Keras Integration Example
print("ü§ù KERAS INTEGRATION")
print("=" * 50)

# Create synthetic dataset for demo
print("üìä Creating synthetic classification dataset...")
n_samples, n_features, n_classes = 1000, 20, 3

# Generate synthetic data
X_data = tf.random.normal((n_samples, n_features))
y_data = tf.random.uniform((n_samples,), maxval=n_classes, dtype=tf.int32)

print(f"‚úÖ Dataset created: {n_samples} samples, {n_features} features, {n_classes} classes")

# Split data (80/20)
split_idx = int(0.8 * n_samples)
X_train, X_test = X_data[:split_idx], X_data[split_idx:]
y_train, y_test = y_data[:split_idx], y_data[split_idx:]

print(f"üìã Train: {len(X_train)} samples")
print(f"üìã Test:  {len(X_test)} samples")

print("\n1Ô∏è‚É£ CREATE OPTIMIZED DATASETS")
print("-" * 32)

# Training dataset with full pipeline
train_dataset = (tf.data.Dataset.from_tensor_slices((X_train, y_train))
                .shuffle(buffer_size=1000, seed=42)
                .batch(32)
                .prefetch(tf.data.AUTOTUNE))

# Validation dataset (no shuffle needed)
val_dataset = (tf.data.Dataset.from_tensor_slices((X_test, y_test))
              .batch(32)
              .prefetch(tf.data.AUTOTUNE))

print("‚úÖ Datasets created:")
print("   üîÑ Train: shuffle ‚Üí batch ‚Üí prefetch")
print("   üìä Val:   batch ‚Üí prefetch")

print("\n2Ô∏è‚É£ CREATE SIMPLE MODEL")
print("-" * 25)

# Simple neural network
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(n_features,)),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(n_classes, activation='softmax')
])

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

print("‚úÖ Model created:")
print(f"   üìä Architecture: {n_features} ‚Üí 64 ‚Üí 32 ‚Üí {n_classes}")
print("   üéØ Task: Multi-class classification")

print("\n3Ô∏è‚É£ TRAIN WITH tf.data")
print("-" * 22)

# Train model with tf.data datasets
print("üöÄ Training model...")
history = model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=3,
    verbose=1
)

print("\n‚úÖ Training complete!")

print("\n4Ô∏è‚É£ EVALUATE PERFORMANCE")
print("-" * 25)

# Evaluate on test set
test_loss, test_accuracy = model.evaluate(val_dataset, verbose=0)
print(f"üìä Test Results:")
print(f"   Loss: {test_loss:.4f}")
print(f"   Accuracy: {test_accuracy:.4f}")

print("\n5Ô∏è‚É£ BATCH PREDICTION")
print("-" * 20)

# Make predictions on a batch
sample_batch = next(iter(val_dataset))
batch_features, batch_labels = sample_batch

predictions = model.predict(batch_features, verbose=0)
predicted_classes = tf.argmax(predictions, axis=1)

print("üîÆ Batch predictions:")
print(f"   Batch size: {len(batch_labels)}")
print(f"   True labels: {batch_labels.numpy()[:5]}")
print(f"   Predictions: {predicted_classes.numpy()[:5]}")

# Calculate batch accuracy
batch_accuracy = tf.reduce_mean(
    tf.cast(tf.equal(predicted_classes, batch_labels), tf.float32)
)
print(f"   Batch accuracy: {batch_accuracy.numpy():.4f}")

print("\nüí° KEY BENEFITS OF tf.data + Keras:")
print("-" * 35)
print("‚úÖ Automatic batching and prefetching")
print("‚úÖ Memory efficient for large datasets") 
print("‚úÖ No need to load entire dataset in memory")
print("‚úÖ Seamless integration with model.fit()")
print("‚úÖ Support for validation_data parameter")
print("‚úÖ Built-in support for steps_per_epoch")

print("\n‚úÖ Keras integration complete!")
print("=" * 50)

---

# üí° 7. Best Practices & Tips

## üéØ Performance Best Practices

### 1. **Pipeline Order** üîÑ
```python
# ‚úÖ OPTIMAL ORDER:
dataset = (tf.data.Dataset.from_tensor_slices(data)
           .shuffle(buffer_size)     # 1. Shuffle first (if needed)
           .map(preprocess_fn)       # 2. Apply transformations
           .cache()                  # 3. Cache processed data
           .batch(batch_size)        # 4. Batch data
           .prefetch(AUTOTUNE))      # 5. Prefetch last
```

### 2. **Memory Management** üíæ
- Use `.cache()` hanya jika data muat di memory
- Gunakan `.cache('/path/to/disk')` untuk disk caching
- Hindari shuffle pada dataset sangat besar

### 3. **Parallelization** ‚ö°
- Gunakan `num_parallel_calls=tf.data.AUTOTUNE`
- Biarkan TensorFlow optimize secara otomatis
- Monitor CPU utilization

---

## üö® Common Pitfalls

### ‚ùå Don't Do This:
```python
# BAD: Shuffle after batch
dataset.batch(32).shuffle(1000)  

# BAD: Cache before expensive operations  
dataset.cache().map(expensive_fn)

# BAD: No prefetching
dataset.batch(32)  # Missing prefetch()
```

### ‚úÖ Do This Instead:
```python
# GOOD: Proper order
dataset.shuffle(1000).map(expensive_fn).cache().batch(32).prefetch(AUTOTUNE)
```

---

## üîß Debugging Tips

### 1. **Inspect Dataset**
```python
# Check first few samples
for sample in dataset.take(3):
    print(sample)

# Check shapes
print(dataset.element_spec)
```

### 2. **Performance Profiling**
```python
import time

start = time.time()
for batch in dataset.take(100):
    pass
print(f"Time: {time.time() - start:.2f}s")
```

---

# üéâ Chapter Summary

## üìö Apa yang Telah Dipelajari

Dalam Chapter 13 ini, kita telah mempelajari:

### üî∞ 1. Data API Fundamentals
- ‚úÖ Konsep tf.data.Dataset
- ‚úÖ Membuat dataset dari berbagai sumber
- ‚úÖ Transformasi dasar (map, filter, batch, shuffle)

### üìÅ 2. Data Loading
- ‚úÖ Loading dari CSV files  
- ‚úÖ TextLineDataset untuk custom parsing
- ‚úÖ Binary dan structured data formats

### üîß 3. Preprocessing
- ‚úÖ Numerical preprocessing (normalization, standardization)
- ‚úÖ Text preprocessing (tokenization, padding, vectorization)
- ‚úÖ Feature engineering techniques

### üöÄ 4. Performance Optimization
- ‚úÖ Prefetching dan caching
- ‚úÖ Parallel processing
- ‚úÖ Optimal pipeline ordering
- ‚úÖ Memory management

### üíæ 5. TFRecord Format
- ‚úÖ Creating dan reading TFRecord files
- ‚úÖ Protocol buffers serialization
- ‚úÖ Compression benefits

### ü§ù 6. Keras Integration
- ‚úÖ Seamless integration dengan model.fit()
- ‚úÖ Training dan validation pipelines
- ‚úÖ Batch prediction workflow

---

## üéØ Key Takeaways

### üí° **The Golden Pipeline:**
```python
optimal_pipeline = (
    tf.data.Dataset.from_tensor_slices(data)
    .shuffle(buffer_size)
    .map(preprocess_fn, num_parallel_calls=tf.data.AUTOTUNE)  
    .cache()
    .batch(batch_size)
    .prefetch(tf.data.AUTOTUNE)
)
```

### üöÄ **Performance Mantra:**
> "Shuffle ‚Üí Map ‚Üí Cache ‚Üí Batch ‚Üí Prefetch"

### üíæ **Format Choice:**
- üìä **CSV**: Prototyping dan dataset kecil
- üíæ **TFRecord**: Production dan dataset besar  
- üîÑ **tf.data**: Always untuk training pipeline

---

## üîÆ Next Steps

Setelah menguasai Chapter 13, Anda siap untuk:

- üß† **Chapter 14**: Convolutional Neural Networks
- üéØ **Chapter 15**: Processing Sequences using RNNs  
- üöÄ **Advanced Topics**: Custom training loops, distributed training
- üíº **Real Projects**: Apply tf.data pada dataset real-world

---

## üèÜ Congratulations!

üéâ **Selamat!** Anda telah menguasai TensorFlow Data API - foundational skill untuk deep learning yang scalable dan efisien!

**Remember**: 
> *"Good data pipelines are the backbone of successful deep learning projects"*

---

### üìñ Resources for Further Learning

- üìö [TensorFlow Data Guide](https://www.tensorflow.org/guide/data)
- üé• [tf.data Best Practices](https://www.tensorflow.org/guide/data_performance)
- üíª [TensorFlow Datasets](https://www.tensorflow.org/datasets)
- üî¨ [Advanced tf.data Techniques](https://www.tensorflow.org/guide/data_performance)

**Happy Learning! üöÄ**

## 6. Integration dengan Keras ü§ù

### 6.1 Dataset untuk Training

tf.data.Dataset dapat langsung digunakan dengan:
- `model.fit()` untuk training
- `model.evaluate()` untuk evaluation  
- `model.predict()` untuk prediction

### 6.2 Preprocessing Layers

Keras juga menyediakan preprocessing layers yang dapat diintegrasikan dalam model:
- `tf.keras.layers.Normalization`
- `tf.keras.layers.StringLookup`  
- `tf.keras.layers.TextVectorization`
- `tf.keras.layers.CategoryEncoding`

In [None]:
# ü§ù Keras Integration Examples
print("=== KERAS INTEGRATION ===")

# 1. Create sample dataset
print("=== PREPARING DATASET FOR KERAS ===")
# Generate sample data
np.random.seed(42)
X_data = np.random.randn(1000, 4)  # 1000 samples, 4 features
y_data = (X_data[:, 0] + X_data[:, 1] > 0).astype(int)  # Binary classification

print(f"Data shape: {X_data.shape}")
print(f"Labels shape: {y_data.shape}")
print(f"Class distribution: {np.bincount(y_data)}")

# Create tf.data.Dataset
dataset = tf.data.Dataset.from_tensor_slices((X_data, y_data))

# Apply preprocessing
preprocessed_dataset = (dataset
                       .shuffle(buffer_size=1000)
                       .batch(32)
                       .prefetch(tf.data.AUTOTUNE))

# Split into train/validation
train_size = int(0.8 * len(X_data))
train_dataset = preprocessed_dataset.take(train_size // 32)
val_dataset = preprocessed_dataset.skip(train_size // 32)

print(f"Training batches: ~{train_size // 32}")
print(f"Validation batches: ~{(len(X_data) - train_size) // 32}")

# 2. Create and train Keras model
print("\n=== KERAS MODEL WITH tf.data ===")

# Define model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(4,)),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

print("Model architecture:")
model.summary()

# Train with tf.data.Dataset
print("\n=== TRAINING WITH tf.data ===")
history = model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=3,
    verbose=1
)

print("‚úÖ Training completed!")
print(f"Final training accuracy: {history.history['accuracy'][-1]:.4f}")
print(f"Final validation accuracy: {history.history['val_accuracy'][-1]:.4f}")

# 3. Demonstrate preprocessing layers
print("\n=== PREPROCESSING LAYERS ===")

# Example with text data
text_data = [
    "I love machine learning",
    "TensorFlow is great", 
    "Deep learning rocks",
    "AI is the future"
]

# Create text vectorization layer
text_vectorizer = tf.keras.layers.TextVectorization(
    max_tokens=1000,
    output_sequence_length=10
)

# Adapt to text data
text_vectorizer.adapt(text_data)

print("Text vectorization example:")
print("Original texts:", text_data[:2])
vectorized = text_vectorizer(text_data[:2])
print("Vectorized:", vectorized.numpy())

# Example with normalization layer
print("\n=== NORMALIZATION LAYER ===")
normalizer = tf.keras.layers.Normalization()

# Adapt to data
sample_data = np.random.randn(100, 3) * 10 + 5  # Mean‚âà5, Std‚âà10
normalizer.adapt(sample_data)

print("Original data stats:")
print(f"  Mean: {np.mean(sample_data, axis=0)}")
print(f"  Std: {np.std(sample_data, axis=0)}")

normalized = normalizer(sample_data[:5])
print("Normalized data (first 5 samples):")
print(normalized.numpy())
print(f"Normalized mean: {np.mean(normalized.numpy(), axis=0)}")
print(f"Normalized std: {np.std(normalized.numpy(), axis=0)}")

print("\n‚úÖ Keras integration examples completed!")

## 7. Best Practices & Summary üìã

### 7.1 Data Pipeline Best Practices

1. **Urutan Optimal Transformasi**:
   ```python
   dataset = (tf.data.Dataset.from_generator(...)
              .map(parse_fn, num_parallel_calls=AUTOTUNE)
              .cache()  # Cache after expensive operations
              .shuffle(buffer_size)
              .batch(batch_size)
              .prefetch(AUTOTUNE))
   ```

2. **Performance Tips**:
   - Gunakan `num_parallel_calls=AUTOTUNE` untuk operasi map
   - Implementasikan `prefetch()` di akhir pipeline
   - Cache dataset setelah operasi mahal, sebelum shuffle
   - Gunakan TFRecord untuk dataset besar
   - Batch sebelum expensive transformations jika memungkinkan

3. **Memory Management**:
   - Hindari `.cache()` untuk dataset yang terlalu besar
   - Gunakan generator untuk data yang tidak muat di memory
   - Pertimbangkan `.cache(filename)` untuk cache ke disk

### 7.2 Common Patterns

- **Image Data**: `map(decode_image) ‚Üí cache() ‚Üí shuffle() ‚Üí batch() ‚Üí prefetch()`
- **Text Data**: `map(tokenize) ‚Üí padded_batch() ‚Üí prefetch()`
- **Structured Data**: `map(normalize) ‚Üí batch() ‚Üí prefetch()`

In [None]:
# üìã Chapter 13 Summary & Best Practices
print("=== CHAPTER 13 SUMMARY ===")
print("üéØ Loading and Preprocessing Data with TensorFlow")
print()

print("üìö KEY CONCEPTS LEARNED:")
concepts = [
    "1. tf.data.Dataset fundamentals and creation methods",
    "2. Dataset transformations (map, filter, batch, shuffle)",
    "3. Loading data from various sources (CSV, TFRecord, etc.)",
    "4. Data preprocessing and feature engineering",
    "5. Text preprocessing and vectorization", 
    "6. Performance optimization (prefetch, cache, parallel)",
    "7. TFRecord format for efficient storage",
    "8. Integration with Keras models and preprocessing layers"
]

for concept in concepts:
    print(f"   ‚úÖ {concept}")

print("\nüöÄ PERFORMANCE OPTIMIZATION CHECKLIST:")
optimizations = [
    "Use num_parallel_calls=AUTOTUNE for map operations",
    "Implement prefetch(AUTOTUNE) at end of pipeline",
    "Cache expensive operations with cache()",
    "Shuffle with appropriate buffer_size",
    "Use TFRecord for large datasets",
    "Batch data for efficient processing",
    "Consider preprocessing layers in model"
]

for opt in optimizations:
    print(f"   üîß {opt}")

print("\nüìä COMMON DATA PIPELINE PATTERN:")
print("""
   dataset = (tf.data.Dataset.from_source(...)
              .map(preprocessing_fn, num_parallel_calls=AUTOTUNE)
              .cache()
              .shuffle(buffer_size=1000)
              .batch(batch_size)
              .prefetch(AUTOTUNE))
""")

print("üéâ NEXT STEPS:")
next_steps = [
    "Practice with real-world datasets",
    "Experiment with different preprocessing techniques",
    "Benchmark pipeline performance",
    "Explore advanced tf.data features",
    "Integrate with complex model architectures"
]

for step in next_steps:
    print(f"   üìà {step}")

print("\n" + "="*50)
print("üéä CONGRATULATIONS! You've completed Chapter 13!")
print("   You now understand TensorFlow Data API and")
print("   can build efficient data pipelines for deep learning!")
print("="*50)