# Chapter 13: Loading and Preprocessing Data with TensorFlow

Bab ini mengajarkan cara **memuat dan memproses data secara efisien** menggunakan `tf.data`, sistem input/output tingkat tinggi dari TensorFlow.

---

## 🎯 Tujuan Bab Ini
- Membuat pipeline data dari array, file teks, gambar, atau TFRecord
- Menerapkan operasi transformasi: `map`, `batch`, `shuffle`, `repeat`
- Membangun pipeline kompleks untuk dataset besar dan beragam
- Menormalkan, membaca, dan memproses data secara efisien

---

## 🧩 Topik yang Dibahas

1. **`from_tensor_slices()`** – Membuat dataset dari tensor sederhana
2. **Transformasi Data** – `map`, `shuffle`, `batch`, `repeat`
3. **TextLineDataset** – Membaca file teks baris per baris
4. **Interleave** – Membaca banyak file secara paralel
5. **Preprocessing Gambar** – Baca & ubah ukuran gambar PNG
6. **TFRecord** – Format penyimpanan biner efisien untuk deployment
7. **Normalisasi Custom** – Hitung mean/std lalu normalisasi per-batch

---

## 💡 Kesimpulan

Dengan `tf.data`, kita bisa membuat pipeline data yang:
- Efisien dan scalable
- Mendukung paralelisme dan prefetching
- Cocok untuk data teks, gambar, dan format khusus (TFRecord)

Pipeline yang baik = training yang cepat dan stabil!


In [3]:
# CHAPTER 13: Loading and Preprocessing Data with TensorFlow

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import os
from tensorflow.keras.preprocessing.image import array_to_img

print("TensorFlow version:", tf.__version__)

# =============================================================================
# 🧪 1. Dataset Dasar (from_tensor_slices)
# =============================================================================

X = tf.range(10)
dataset = tf.data.Dataset.from_tensor_slices(X)
print("# Dataset dasar (from_tensor_slices):")
for item in dataset:
    print(item.numpy(), end=" ")
print("\n")

# =============================================================================
# 🔄 2. Transformasi Dataset (map, shuffle, batch, repeat)
# =============================================================================

dataset = dataset.map(lambda x: x * 2)
dataset = dataset.shuffle(buffer_size=5)
dataset = dataset.batch(3)
dataset = dataset.repeat(2)

print("# Dataset setelah map, shuffle, batch, repeat:")
for batch in dataset:
    print(batch.numpy())
print("\n")

# =============================================================================
# 📄 3. Membaca File Teks
# =============================================================================

file_path = "sample_lines.txt"
with open(file_path, "w") as f:
    f.write("line 1\nline 2\nline 3\nline 4\n")

text_ds = tf.data.TextLineDataset(file_path)
print("# Membaca file baris per baris:")
for line in text_ds:
    print(line.numpy().decode("utf-8"))
print("\n")

# =============================================================================
# 🔁 4. Interleave File
# =============================================================================

for i in range(3):
    with open(f"file_{i}.txt", "w") as f:
        f.write(f"File {i} - Line 1\n")
        f.write(f"File {i} - Line 2\n")

file_list_ds = tf.data.Dataset.list_files("file_*.txt")
dataset = file_list_ds.interleave(
    lambda fname: tf.data.TextLineDataset(fname),
    cycle_length=3)

print("# Interleave dari beberapa file:")
for line in dataset:
    print(line.numpy().decode())
print("\n")

# =============================================================================
# 🖼️ 5. Pipeline Gambar (Revisi)
# =============================================================================

img_dir = "images"
os.makedirs(img_dir, exist_ok=True)

# Simpan gambar RGB agar tidak error saat load
for i in range(5):
    img = np.random.rand(28, 28, 3) * 255  # RGB
    array_to_img(img.astype(np.uint8)).save(f"{img_dir}/img_{i}.png")

print("Files in image folder:", os.listdir(img_dir))

# Load dan preprocess
img_ds = tf.data.Dataset.list_files(f"{img_dir}/*.png")

def load_and_preprocess(path):
    img = tf.io.read_file(path)
    img = tf.image.decode_png(img, channels=3)
    img = tf.image.resize(img, [28, 28])
    img = tf.cast(img, tf.float32) / 255.0
    return img

img_ds = img_ds.map(load_and_preprocess).batch(2)

print("# Batch gambar yang sudah diproses:")
for batch in img_ds.take(1):
    print("Shape:", batch.shape)
print("\n")

# =============================================================================
# 📦 6. TFRecord Format
# =============================================================================

tfrecord_file = "data.tfrecord"
with tf.io.TFRecordWriter(tfrecord_file) as writer:
    for i in range(5):
        feature = {
            "feature": tf.train.Feature(int64_list=tf.train.Int64List(value=[i]))
        }
        example = tf.train.Example(features=tf.train.Features(feature=feature))
        writer.write(example.SerializeToString())

def parse_example(serialized):
    feature_description = {
        "feature": tf.io.FixedLenFeature([], tf.int64)
    }
    return tf.io.parse_single_example(serialized, feature_description)

tfrecord_ds = tf.data.TFRecordDataset([tfrecord_file])
tfrecord_ds = tfrecord_ds.map(parse_example)

print("# Membaca kembali dari TFRecord:")
for record in tfrecord_ds:
    print(record["feature"].numpy())
print("\n")

# =============================================================================
# ⚙️ 7. Normalisasi Batch (Custom Preprocessing)
# =============================================================================

raw_data = tf.data.Dataset.from_tensor_slices(tf.random.normal([1000, 3]))
raw_data = raw_data.batch(32)

def normalize_batch(batch):
    mean = tf.reduce_mean(batch, axis=0)
    std = tf.math.reduce_std(batch, axis=0)
    return (batch - mean) / std

norm_ds = raw_data.map(normalize_batch)

print("# Contoh batch yang dinormalisasi:")
for batch in norm_ds.take(1):
    print("Normalized batch shape:", batch.shape)


TensorFlow version: 2.18.0
# Dataset dasar (from_tensor_slices):
0 1 2 3 4 5 6 7 8 9 

# Dataset setelah map, shuffle, batch, repeat:
[6 8 2]
[ 4 16 10]
[ 0 12 18]
[14]
[ 0  8 10]
[12  4  6]
[14 16  2]
[18]


# Membaca file baris per baris:
line 1
line 2
line 3
line 4


# Interleave dari beberapa file:
File 2 - Line 1
File 1 - Line 1
File 0 - Line 1
File 2 - Line 2
File 1 - Line 2
File 0 - Line 2


Files in image folder: ['img_1.png', 'img_2.png', 'img_0.png', 'img_4.png', 'img_3.png']
# Batch gambar yang sudah diproses:
Shape: (2, 28, 28, 3)


# Membaca kembali dari TFRecord:
0
1
2
3
4


# Contoh batch yang dinormalisasi:
Normalized batch shape: (32, 3)
