# BAB 15: TFX - MLOps dan Deployment Model dengan TensorFlow

## Ringkasan

Bab ini membahas **TFX (TensorFlow Extended)**, framework komprehensif untuk implementasi MLOps (Machine Learning Operations) dan productionization model machine learning. TFX menyediakan pipeline end-to-end yang mengotomatisasi seluruh workflow mulai dari ingestion data, transformasi features, training model, evaluasi, hingga deployment ke production environment. Bab mencakup definisi MLOps versus productionization, komponen-komponen TFX (CsvExampleGen, StatisticsGen, SchemaGen, Transform, Trainer), feature engineering dengan tensorflow_transform, anomaly detection dengan tensorflow_data_validation, model training dengan Keras API, SignatureDefs untuk serving, containerization dengan Docker, dan deployment via TensorFlow Serving untuk expose model sebagai REST API.

---

## Bagian 1: Menulis Data Pipeline dengan TFX

### Konsep MLOps dan Productionization

**MLOps (Machine Learning Operations)**: Workflow yang mengotomatisasi sebagian besar langkah dari pengumpulan data hingga delivery model terlatih, dengan minimal intervensi manusia. MLOps menggabungkan filosofi ML dan DevOps untuk meningkatkan velocity development dan operasi model.

**Productionization**: Proses deployment trained model (di private server atau cloud) yang memungkinkan customer menggunakan model untuk tujuan yang dirancang dengan cara robust. Include design scalable API yang dapat handle ribuan request per detik.

**Analogi**: MLOps adalah **perjalanan** (journey), productionization adalah **destinasi** (deployment final model).

**Mengapa MLOps Penting**:
- Untuk perusahaan besar (Google, Facebook, Amazon), ratusan atau ribuan model produce predictions setiap detik
- Model tidak boleh menjadi stale, perlu continuous training/fine-tuning dengan new incoming data
- MLOps dapat ingest data, train models, automatic evaluate, dan push ke production jika pass validation check
- Validation check penting untuk safeguard terhadap rogue underperforming models

**TFX (TensorFlow Extended)**: Library yang menyediakan semua komponen untuk implement machine learning pipeline yang ingest data, transform ke features, train model, dan push ke production environment.

### Case Study: Prediksi Severity Kebakaran Hutan

**Dataset**: Historical forest fires di Montesinho park, Portugal (tersedia di UCI ML Repository). Format CSV dengan 13 features:
- **X, Y**: Spatial coordinates dalam park map
- **month, day**: Bulan dan hari dalam minggu
- **FFMC (Fine Fuel Moisture Code)**: Fuel moisture dari forest litter
- **DMC (Duff Moisture Code)**: Numerical rating average moisture content soil
- **DC (Drought Code)**: Depth dryness dalam soil
- **ISI (Initial Spread Index)**: Expected rate fire spread
- **temp**: Suhu (Celsius)
- **RH**: Relative humidity (%)
- **wind**: Kecepatan angin (km/h)
- **rain**: Outside rain (mm/m²)
- **area**: Burnt area forest (hectares) - **Target variable**

**Task**: Regression problem untuk predict burnt area given semua features lainnya.

### Setup Environment dan Download Data

**Requirement Environment**:
- **Operating System**: Linux (Ubuntu) highly recommended, TFX tidak tested untuk Windows
- **TFX Version**: 1.6.0 (versi later memiliki bug di interactive environment)
- **Docker**: Untuk containerization model serving
- **Anaconda Environment**:
  ```bash
  conda create -n manning.tf2.tfx python=3.6
  conda activate manning.tf2.tfx
  pip install --use-deprecated=legacy-resolver -r requirements.txt
  ```

**Download Data**:
```python
import requests
import os

url = "http://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv"
r = requests.get(url)
os.makedirs(os.path.join('data', 'csv'), exist_ok=True)
with open(os.path.join('data', 'csv', 'forestfires.csv'), 'wb') as f:
    f.write(r.content)
```

**Data Splitting**:
- 95% untuk training/validation
- 5% untuk testing (dedicated test set tidak seen oleh model)

### Komponen 1: CsvExampleGen

**Purpose**: Membaca data dari CSV file, split ke train/eval, convert ke TFRecord format.

**Implementation**:
```python
from tfx.components import CsvExampleGen

example_gen = CsvExampleGen(input_base=os.path.join('data', 'csv', 'train'))
context.run(example_gen)
```

**InteractiveContext**: Context untuk run various TFX steps, manage states between steps, maintain metadata store (SQLite database).

**Metadata Store**: Database yang log informasi tentang inputs, outputs, execution-related outputs (component run identifier, errors). Immensely helpful untuk debugging complex TFX pipelines.

**Output CsvExampleGen**:
- **Directory Structure**:
  ```
  pipeline/examples/forest_fires_pipeline/CsvExampleGen/examples/1/
    ├── Split-train/  (TFRecord files .gz)
    └── Split-eval/   (TFRecord files .gz)
  ```
- **Split Method**: Hashing-based splitting. TFX menggunakan hash buckets (default 3 buckets: 2 untuk train, 1 untuk eval) untuk assign examples ke train/eval
- **Hash Buckets**: TFX generate hash dari values dalam record, hash value determine bucket assignment. Contoh: hash 7 → bucket = 7 % 3 = 1

**TFRecord Format**: Data stored sebagai byte streams. Efficient untuk TensorFlow pipeline, dapat retrieved sebagai tf.data.Dataset.

**Inspecting TFRecord Data**:
```python
train_uri = os.path.join(example_gen.outputs['examples'].get()[0].uri, 'Split-train')
tfrecord_filenames = [os.path.join(train_uri, name) for name in os.listdir(train_uri)]
dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type="GZIP")

for tfrecord in dataset.take(2):
    serialized_example = tfrecord.numpy()
    example = tf.train.Example()
    example.ParseFromString(serialized_example)
    print(example)
```

**Data Structure dalam TFRecord**: tf.train.Example dengan collection features, setiap feature punya key (column name) dan value (float_list, int64_list, atau bytes_list).

### Komponen 2: StatisticsGen

**Purpose**: Generate basic statistics dan visualizations untuk Exploratory Data Analysis (EDA).

**Implementation**:
```python
from tfx.components import StatisticsGen

statistics_gen = StatisticsGen(examples=example_gen.outputs['examples'])
context.run(statistics_gen)
context.show(statistics_gen.outputs['statistics'])
```

**Output Visualizations**: Rich collection graphs untuk understand data quality dan distribution.

**Key Information Displayed**:
1. **Numerical Features Section**:
   - Count, missing percentage
   - Mean, standard deviation
   - Min, max, median
   - Histogram distribution (dapat sangat skewed, contoh: FFMC concentrated di 80-90 range)

2. **Categorical Features Section**:
   - Unique values count
   - Missing percentage
   - Mode (most frequent value) count
   - Bar graph untuk features dengan sedikit unique values
   - Line graph untuk features dengan banyak unique values (less cluttered)

**Dashboard Controls**:
- **Search bar**: Filter features by name
- **Data type filter**: Show only numerical atau categorical
- **Chart type**: Standard histogram atau quantile-based
- **Sort order**: Berbagai sorting options

### Komponen 3: SchemaGen

**Purpose**: Automatically derive schema dari data. Schema = blueprint untuk data, expressing structure dan important attributes.

**Database Schema Analogy**: Sama seperti database schema yang define table structure, TFX schema define data structure dan validation rules.

**Implementation**:
```python
from tfx.components import SchemaGen

schema_gen = SchemaGen(
    statistics=statistics_gen.outputs['statistics'],
    infer_feature_shape=False  # Important untuk downstream Transform step
)
context.run(schema_gen)
context.show(schema_gen.outputs['schema'])
```

**infer_feature_shape Argument**:
- **False**: Tensors passed ke Transform step sebagai tf.SparseTensor (more flexibility untuk feature manipulations)
- **True**: Tensors sebagai tf.Tensor dengan known shape

**Schema Components**:
1. **Feature name**: Column identifier
2. **Type**: INT, FLOAT, STRING, BOOLEAN
3. **Presence**: required, optional
4. **Valency**: single, repeated
5. **Domain**: Constraints untuk feature values

**Domain Types** (defined di schema.proto):
- **Integer domain**: Define min/max untuk integer features
- **Float domain**: Define min/max untuk float features
- **String domain**: Define allowed values/tokens untuk string features
- **Boolean domain**: Custom values untuk true/false states
- **Struct domain**: Recursive domains atau domains dengan multiple features
- **Natural language domain**: Define vocabulary untuk language features
- **Image domain**: Restrict maximum byte size gambar
- **Time domain**: Define date/time features
- **Time of day domain**: Define time tanpa date

**Protobuf Library**: Library untuk object serialization/deserialization developed by Google. Schema defined dengan .proto files, deserialization via functions seperti ParseFromString().

### Komponen 4: Transform

**Purpose**: Convert raw data columns ke model-ready features dengan various transformations.

**Feature Types yang Dibuat**:
1. **Dense floating-point features**: Values passed as-is dengan optional normalization (contoh: temperature, wind speed)
2. **Bucketized features**: Numerical values binned ke predefined intervals (contoh: RH dibucketize ke low/medium/high)
3. **Categorical features**: Values dari predefined set, converted ke integer indices menggunakan vocabulary (contoh: day, month)

**Feature Assignment untuk Dataset**:
- **Dense float features**: X, Y, wind, rain, FFMC, DMC, DC, ISI, temp (Z-score normalization)
- **Bucketized features**: RH (buckets: [-inf, 33), [33, 66), [66, inf])
- **Categorical features**: month (12 categories), day (7 categories)
- **Label feature**: area (kept as numerical untuk regression)

**Constants Definition** (forest_fires_constants.py):
```python
VOCAB_FEATURE_KEYS = ['day', 'month']
MAX_CATEGORICAL_FEATURE_VALUES = [7, 12]
DENSE_FLOAT_FEATURE_KEYS = ['DC', 'DMC', 'FFMC', 'ISI', 'rain', 'temp', 'wind', 'X', 'Y']
BUCKET_FEATURE_KEYS = ['RH']
BUCKET_FEATURE_BOUNDARIES = [(33, 66)]
LABEL_KEY = 'area'

def transformed_name(key):
    return key + '_xf'  # Suffix untuk distinguish transformed features
```

**Preprocessing Function** (forest_fires_transform.py):
```python
import tensorflow_transform as tft

def preprocessing_fn(inputs):
    outputs = {}
    
    # Dense features: Z-score normalization
    for key in DENSE_FLOAT_FEATURE_KEYS:
        outputs[transformed_name(key)] = tft.scale_to_z_score(
            sparse_to_dense(inputs[key])
        )
    
    # Vocabulary-based categorical: Build vocab, convert to integer ID
    for key in VOCAB_FEATURE_KEYS:
        outputs[transformed_name(key)] = tft.compute_and_apply_vocabulary(
            sparse_to_dense(inputs[key]),
            num_oov_buckets=1  # Assign unseen strings ke special category
        )
    
    # Bucketized features: Apply bucket boundaries
    for key, boundary in zip(BUCKET_FEATURE_KEYS, BUCKET_FEATURE_BOUNDARIES):
        outputs[transformed_name(key)] = tft.apply_buckets(
            sparse_to_dense(inputs[key]),
            bucket_boundaries=[boundary]
        )
    
    # Label: Keep as-is
    outputs[transformed_name(LABEL_KEY)] = sparse_to_dense(inputs[LABEL_KEY])
    
    return outputs

def sparse_to_dense(x):
    return tf.squeeze(
        tf.sparse.to_dense(
            tf.SparseTensor(x.indices, x.values, [x.dense_shape[0], 1])
        ),
        axis=1
    )
```

**tensorflow_transform Library**: Sub-library focused pada feature transformations. Menyediakan functions untuk:
- Bucketizing features
- Bag-of-words dari string column
- Covariance matrices
- Mean, std, min, max, count columns

**Z-score Normalization Formula**:
```
z = (x - μ(x)) / σ(x)
```
Dimana μ(x) = mean value column, σ(x) = standard deviation column.

**Transform Component Implementation**:
```python
from tfx.components import Transform

transform = Transform(
    examples=example_gen.outputs['examples'],
    schema=schema_gen.outputs['schema'],
    module_file=os.path.abspath('forest_fires_transform.py')
)
context.run(transform)
```

**Inspecting Transform Output**:
- Transformed features disimpan sebagai TFRecord files dengan suffix `_xf`
- Float features: Normalized values (contoh: DC_xf = 0.4196)
- Categorical features: Integer indices (contoh: day_xf = 2)
- Bucketized features: Bucket indices (contoh: RH_xf = 0)

**Rule of Thumb**: Always check pipeline interim outputs whenever possible untuk sanity-check. TFX low visibility, not highly matured—probe components untuk verify inputs/outputs.

---

## Bagian 2: Training Regression Neural Network dengan TFX Trainer API

### Keras Model Definition dengan Feature Columns

**tf.feature_column**: Feature representation standard accepted by TensorFlow models. Handy untuk define data column-oriented fashion (each feature = column). Suitable untuk structured data.

**Feature Column Types**:
1. **tf.feature_column.numeric_column**: Dense floating-point fields (contoh: temperature)
2. **tf.feature_column.categorical_column_with_identity**: Categorical/bucketized fields dengan integer index (contoh: day, month)
3. **tf.feature_column.indicator_column**: Convert categorical_column_with_identity ke one-hot encoded representation
4. **tf.feature_column.embedding_column**: Generate embedding dari integer-based column

**Building Feature Columns**:
```python
def _build_keras_model():
    # Dense float features
    real_valued_columns = [
        tf.feature_column.numeric_column(key=key, shape=(1,))
        for key in transformed_names(DENSE_FLOAT_FEATURE_KEYS)
    ]
    
    # Bucketized features (one-hot encoded)
    categorical_columns = [
        tf.feature_column.indicator_column(
            tf.feature_column.categorical_column_with_identity(
                key,
                num_buckets=len(boundaries)+1
            )
        )
        for key, boundaries in zip(
            transformed_names(BUCKET_FEATURE_KEYS),
            BUCKET_FEATURE_BOUNDARIES
        )
    ]
    
    # Vocabulary-based categorical features (one-hot encoded)
    categorical_columns += [
        tf.feature_column.indicator_column(
            tf.feature_column.categorical_column_with_identity(
                key,
                num_buckets=num_buckets,
                default_value=num_buckets-1  # Unseen categories assigned here
            )
        )
        for key, num_buckets in zip(
            transformed_names(VOCAB_FEATURE_KEYS),
            MAX_CATEGORICAL_FEATURE_VALUES
        )
    ]
    
    # Build model
    model = _dnn_regressor(
        columns=real_valued_columns + categorical_columns,
        dnn_hidden_units=[128, 64]
    )
    return model
```

**Toy Example - numeric_column**:
```python
a = tf.feature_column.numeric_column("a")
x = tf.keras.layers.DenseFeatures(a)({'a': [0.5, 0.6]})
print(x)  # [[0.5], [0.6]], shape=(2, 1)
```

**Toy Example - categorical with indicator**:
```python
b = tf.feature_column.indicator_column(
    tf.feature_column.categorical_column_with_identity('b', num_buckets=10)
)
y = tf.keras.layers.DenseFeatures(b)({'b': [5, 2]})
print(y)  # [[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.], [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]]
```

### Deep Neural Network Regressor

**Architecture**:
```python
def _dnn_regressor(columns, dnn_hidden_units):
    # Input layers: dictionary mapping feature name → Input layer
    input_layers = {
        colname: tf.keras.layers.Input(name=colname, shape=(), dtype=tf.float32)
        for colname in transformed_names(DENSE_FLOAT_FEATURE_KEYS)
    }
    input_layers.update({
        colname: tf.keras.layers.Input(name=colname, shape=(), dtype='int32')
        for colname in transformed_names(VOCAB_FEATURE_KEYS)
    })
    input_layers.update({
        colname: tf.keras.layers.Input(name=colname, shape=(), dtype='int32')
        for colname in transformed_names(BUCKET_FEATURE_KEYS)
    })
    
    # DenseFeatures layer: aggregate all Input layers → single tensor
    output = tf.keras.layers.DenseFeatures(columns)(input_layers)
    
    # Hidden layers
    for numnodes in dnn_hidden_units:
        output = tf.keras.layers.Dense(numnodes, activation='tanh')(output)
    
    # Regression output layer
    output = tf.keras.layers.Dense(1)(output)  # Linear activation
    
    # Compile model
    model = tf.keras.Model(input_layers, output)
    model.compile(
        loss='mean_squared_error',
        optimizer=tf.keras.optimizers.Adam(lr=0.001)
    )
    
    return model
```

**DenseFeatures Layer Functionality**:
- **Input**: Dictionary Input layers + list feature columns
- **Process**: Map each Input layer ke corresponding feature column
- **Output**: Single tensor output (contoh: shape [batch_size, 31] untuk 31 total features)

**Model Summary**:
- Input layer untuk setiap feature (12 features total)
- DenseFeatures aggregate ke [None, 31] tensor (one-hot encoding expand dimensions)
- Hidden layer 1: 128 nodes (tanh activation)
- Hidden layer 2: 64 nodes (tanh activation)
- Output layer: 1 node (linear activation untuk regression)
- Total parameters: ~12,417

### Data Input Function

**Purpose**: Generate tf.data.Dataset objects dari TFRecord files untuk training dan evaluation.

**Type Hinting in Python**: Visual cue untuk ensure correct input/output types (not enforced by interpreter).
- Syntax: `def function(argument: type) -> return_type:`
- Complex types: Use `typing` library (contoh: `List[Text]` = list of strings)

**Implementation**:
```python
from typing import List, Text

def _input_fn(file_pattern: List[Text],
              data_accessor: tfx.components.DataAccessor,
              tf_transform_output: tft.TFTransformOutput,
              batch_size: int = 200) -> tf.data.Dataset:
    
    return data_accessor.tf_dataset_factory(
        file_pattern,
        tfxio.TensorFlowDatasetOptions(
            batch_size=batch_size,
            label_key=transformed_name(LABEL_KEY)
        ),
        tf_transform_output.transformed_metadata.schema
    )
```

**Arguments**:
- **file_pattern**: List file paths containing data
- **data_accessor**: Special TFX object untuk create tf.data.Dataset dari filenames
- **tf_transform_output**: Transformation graph untuk convert raw examples ke features
- **batch_size**: Batch size integer

**DataAccessor Functionality**: Takes file paths, data set options (batch size, label key), schema → returns tf.data.Dataset dengan features separated dari label.

### Model Training Function

**FnArgs Object**: Utility object dalam TensorFlow untuk declare training-related user-defined arguments.

**Key Attributes dalam FnArgs**:
- **train_files**: List train filenames
- **eval_files**: List evaluation filenames
- **train_steps**: Number training steps
- **eval_steps**: Number evaluation steps
- **schema_path**: Path ke schema generated by SchemaGen
- **transform_graph_path**: Path ke transform graph by Transform
- **serving_model_dir**: Output directory untuk serve-able model
- **model_run_dir**: Output directory untuk model run artifacts

**run_fn Implementation** (forest_fires_trainer.py):
```python
def run_fn(fn_args: tfx.components.FnArgs):
    # Log fn_args untuk visibility
    absl.logging.info("=" * 50)
    absl.logging.info("Printing fn_args object")
    absl.logging.info(fn_args)
    absl.logging.info("=" * 50)
    
    # Load transform graph
    tf_transform_output = tft.TFTransformOutput(fn_args.transform_graph_path)
    
    # Create datasets
    train_dataset = _input_fn(
        fn_args.train_files,
        fn_args.data_accessor,
        tf_transform_output,
        batch_size=40
    )
    eval_dataset = _input_fn(
        fn_args.eval_files,
        fn_args.data_accessor,
        tf_transform_output,
        batch_size=40
    )
    
    # Build model
    model = _build_keras_model()
    
    # CSV logger callback
    csv_write_dir = os.path.join(fn_args.model_run_dir, 'model_performance')
    os.makedirs(csv_write_dir, exist_ok=True)
    csv_callback = tf.keras.callbacks.CSVLogger(
        os.path.join(csv_write_dir, 'performance.csv'),
        append=False
    )
    
    # Train model
    model.fit(
        train_dataset,
        steps_per_epoch=fn_args.train_steps,
        validation_data=eval_dataset,
        validation_steps=fn_args.eval_steps,
        epochs=10,
        callbacks=[csv_callback]
    )
    
    # Define signatures
    signatures = {
        'serving_default': _get_serve_tf_examples_fn(
            model, tf_transform_output
        ).get_concrete_function(
            tf.TensorSpec(shape=[None], dtype=tf.string, name='examples')
        )
    }
    
    # Save model
    model.save(
        fn_args.serving_model_dir,
        save_format='tf',
        signatures=signatures
    )
```

### SignatureDefs: Defining Model Behavior untuk API Requests

**Purpose**: Signatures define how model behaves when data sent via API call saat model deployed. Similar to personal signatures uniquely identify person, TensorFlow signatures uniquely determine model behavior untuk HTTP requests.

**Signature Components**:
- **Key**: Unique identifier define exact URL untuk activate signature
- **Value**: TensorFlow function (@tf.function decorated) define how input handled dan passed ke model

**TensorFlow Signature Names** (defined constants):
1. **PREDICT_METHOD_NAME** (`'tensorflow/serving/predict'`): Predict target untuk incoming inputs (no target required)
2. **REGRESS_METHOD_NAME** (`'tensorflow/serving/regress'`): Regress dari example (expects input + target)
3. **CLASSIFY_METHOD_NAME** (`'tensorflow/serving/classify'`): Classify example (expects input + target)
4. **DEFAULT_SERVING_SIGNATURE_DEF_KEY** (`'serving_default'`): Default signature (minimum requirement)

**@tf.function Decorator**: Takes function dengan TensorFlow operations, traces steps, converts ke data-flow graph. Required untuk signature definitions.

**Serving Function Implementation**:
```python
def _get_serve_tf_examples_fn(model, tf_transform_output):
    # Attach transform layer ke model
    model.tft_layer = tf_transform_output.transform_features_layer()
    
    @tf.function
    def serve_tf_examples_fn(serialized_tf_examples):
        # Get raw feature specs (column name → Feature type mapping)
        feature_spec = tf_transform_output.raw_feature_spec()
        feature_spec.pop(LABEL_KEY)  # Remove label dari input
        
        # Parse serialized examples
        parsed_features = tf.io.parse_example(
            serialized_tf_examples,
            feature_spec
        )
        
        # Transform raw columns ke features
        transformed_features = model.tft_layer(parsed_features)
        
        # Return model predictions
        return model(transformed_features)
    
    return serve_tf_examples_fn
```

**Feature Spec Structure**: Dictionary mapping column names ke Feature types (VarLenFeature with dtype).
```python
{
    'DC': VarLenFeature(dtype=tf.float32),
    'RH': VarLenFeature(dtype=tf.int64),
    'day': VarLenFeature(dtype=tf.string),
    ...
}
```

**get_concrete_function()**: Returns traced function only (tidak execute graph). Traces function dan creates data-flow graph tanpa execution.

**TransformFeaturesLayer**: Keras layer yang know how to convert parsed examples ke batch inputs dengan multiple features.

### Training dengan TFX Trainer Component

**Calculate Training Steps**:
```python
n_dataset_size = df.shape[0]
batch_size = 40

# Training: 2/3 data (2 hash buckets)
n_train_steps = int(2 * n_dataset_size / (3 * batch_size))
n_train_steps_mod = 2 * n_dataset_size % (3 * batch_size)
if n_train_steps_mod != 0:
    n_train_steps += 1

# Evaluation: 1/3 data (1 hash bucket)
n_eval_steps = int(n_dataset_size / (3 * batch_size))
n_eval_steps_mod = n_dataset_size % (3 * batch_size)
if n_eval_steps_mod != 0:
    n_eval_steps += 1
```

**Trainer Component**:
```python
from tfx.components import Trainer
from tfx.proto import trainer_pb2

trainer = Trainer(
    module_file=os.path.abspath("forest_fires_trainer.py"),
    transformed_examples=transform.outputs['transformed_examples'],
    schema=schema_gen.outputs['schema'],
    transform_graph=transform.outputs['transform_graph'],
    train_args=trainer_pb2.TrainArgs(num_steps=n_train_steps),
    eval_args=trainer_pb2.EvalArgs(num_steps=n_eval_steps)
)
context.run(trainer)
```

**Training Output Log**:
- Generates wheel package (.whl) dari training code
- Prints model summary (Input layers, DenseFeatures, hidden layers, output)
- Warning tentang tf.function retracing (unavoidable dengan TFX Trainer behavior)
- Training progress dengan loss per epoch
- Final model saved ke `<pipeline_root>/Trainer/model/<execution_id>/Format-Serving`

**Training Results**:
- Epoch 1: loss ~13,636, val_loss ~574
- Epoch 10: loss ~1,105, val_loss ~456
- Final MSE ~456 → error ~22 hectares (0.22 km²) per example
- Error cukup besar, disebabkan anomalies dalam data

### Anomaly Detection dan Removal

**Problem**: Validation loss 456 (MSE) → prediction off by 22 hectares. Data mengandung banyak outliers.

**tensorflow_data_validation (tfdv) Library**: Provides functions untuk validate data against schema, display anomalies, edit schema.

**Key Functions**:
- **tfdv.validate_statistics()**: Validate data against schema
- **tfdv.display_anomalies()**: List anomalous samples
- **tfdv.get_feature()**: Edit schema untuk modify outlier criteria
- **tfdv.visualize_statistics()**: Visualize original vs cleaned data

**Schema Editing Example**:
```python
isi_feature = tfdv.get_feature(schema, 'ISI')
isi_feature.float_domain.max = 30.0  # Change maximum allowed value
```

**ExampleValidator Component**: TFX component untuk ensure no anomalies dalam data set setelah cleaning.

---

## Bagian 3: Containerization dengan Docker

### Konsep Docker

**Container**: Lightweight, portable, isolated environment untuk run applications dengan all dependencies. Berbeda dengan Virtual Machines (VMs):

**Docker vs Virtual Machines**:
- **VM**: Include full OS, heavy (GBs), slow startup, resource-intensive
- **Container**: Share host OS kernel, lightweight (MBs), fast startup (<1 second), efficient resource usage

**Docker Components**:
1. **Docker Image**: Blueprint untuk container (static snapshot dengan application + dependencies)
2. **Docker Container**: Running instance dari image (isolated runtime environment)
3. **Dockerfile**: Text file dengan instructions untuk build Docker image

**Analogy**: Image = recipe, Container = cooked dish dari recipe.

### Dockerfile untuk TensorFlow Serving

**Base Image**: `tensorflow/serving:2.6.0` (official TensorFlow serving image).

**Dockerfile Structure**:
```dockerfile
FROM tensorflow/serving:2.6.0

# Environment variables
ENV MODEL_BASE_PATH=/models
ENV MODEL_NAME=forest_fires_model

# Copy model ke container
COPY --from=<source> <model_path> ${MODEL_BASE_PATH}/${MODEL_NAME}/1

# Expose port untuk API
EXPOSE 8501

# Run TensorFlow serving server
CMD ["tensorflow_model_server", \
     "--rest_api_port=8501", \
     "--model_name=${MODEL_NAME}", \
     "--model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME}"]
```

**Key Instructions**:
- **FROM**: Specify base image
- **ENV**: Set environment variables
- **COPY**: Copy files ke container
- **EXPOSE**: Declare network port
- **CMD**: Command untuk run saat container starts

**Model Versioning**: Models organized dalam directories dengan version numbers (contoh: `/models/forest_fires_model/1`). Allows serving multiple versions simultaneously.

### Building Docker Image

**Build Command**:
```bash
docker build -t forest-fires-model-server:latest .
```

**Arguments**:
- **-t**: Tag name untuk image
- **.**: Build context (current directory)

**Build Process**:
1. Download base image jika not exists locally
2. Execute Dockerfile instructions sequentially
3. Create layers untuk each instruction (cached untuk efficiency)
4. Tag final image dengan specified name

**List Images**:
```bash
docker images
```

### Running Docker Container

**Run Command**:
```bash
docker run -p 8501:8501 --name forest-fires-server forest-fires-model-server:latest
```

**Arguments**:
- **-p 8501:8501**: Port mapping (host:container)
- **--name**: Container name
- Last argument: Image name

**Port Mapping**: Maps host machine port ke container port. Allows access container services dari host.

**Container Management Commands**:
```bash
docker ps                    # List running containers
docker ps -a                 # List all containers (including stopped)
docker stop <container_name> # Stop container
docker start <container_name># Start stopped container
docker rm <container_name>   # Remove container
docker logs <container_name> # View container logs
```

---

## Bagian 4: Deployment dan Serving via REST API

### REST API Concepts

**REST (Representational State Transfer)**: Architectural style untuk designing networked applications. Uses HTTP requests untuk access dan manipulate resources.

**HTTP Methods**:
- **GET**: Request data dari server (no request body, can be cached)
- **POST**: Send data ke server (has request body, more secure untuk sensitive data)
- **PUT**: Update existing resource
- **DELETE**: Remove resource

**HTTP Request Anatomy**:
1. **Method type**: GET, POST, PUT, DELETE
2. **Path**: URL ke endpoint
3. **Body**: Payload needed untuk complete request
4. **Header**: Additional information (contoh: content-type)

**HTTP Response Anatomy**:
1. **Status code**: Indicate request success/failure (200 = success, 404 = not found, 500 = server error)
2. **Header**: Metadata about response
3. **Body**: Response data (contoh: predictions)

### TensorFlow Serving API Endpoints

**Base URL Format**: `http://<hostname>:<port>/v1/models/<model_name>`

**Available Endpoints**:
1. **:predict**
   - **URL**: `/v1/models/forest_fires_model:predict`
   - **Method**: POST
   - **Purpose**: Predict output tanpa target
   - **Input**: serialized examples
   - **Output**: predictions

2. **:regress**
   - **URL**: `/v1/models/forest_fires_model:regress`
   - **Method**: POST
   - **Purpose**: Regression dengan input + target
   - **Output**: predictions + error metrics

3. **:classify**
   - **URL**: `/v1/models/forest_fires_model:classify`
   - **Method**: POST
   - **Purpose**: Classification dengan input + target
   - **Output**: class predictions + error metrics

4. **/metadata**
   - **URL**: `/v1/models/forest_fires_model/metadata`
   - **Method**: GET
   - **Purpose**: Get metadata tentang available endpoints/signatures
   - **Output**: model signature information

### Making API Requests dengan Python

**Request Body Format**:
```python
{
    "signature_name": "serving_default",
    "instances": [<serialized_examples>]
}
```

**Base64 Encoding**: Encodes byte stream (binary input) ke ASCII text string. Required untuk serialize examples dalam HTTP request.

**Python Implementation**:
```python
import base64
import json
import requests

# Prepare request body
req_body = {
    "signature_name": "serving_default",
    "instances": [
        str(base64.b64encode(
            b'{"X": 7, "Y": 4, "month": "oct", "day": "fri", '
            b'"FFMC": 60, "DMC": 30, "DC": 200, "ISI": 9, '
            b'"temp": 30, "RH": 50, "wind": 10, "rain": 0}'
        ))
    ]
}

# Convert ke JSON
data = json.dumps(req_body)

# Send POST request
json_response = requests.post(
    'http://localhost:8501/v1/models/forest_fires_model:predict',
    data=data,
    headers={"content-type": "application/json"}
)

# Parse response
predictions = json.loads(json_response.text)
print(predictions)  # {'predictions': [[2.77522683]]}
```

**Response Structure**:
```python
{
    'predictions': [[predicted_value]]
}
```

### End-to-End Workflow Visualization

**Complete Pipeline**:
1. **Client** sends HTTP POST request dengan input data
2. **TensorFlow Serving Server** (dalam Docker container) listens pada port 8501
3. **API Endpoint** receives request, routes ke appropriate signature
4. **Signature Function** parses serialized examples, transforms features
5. **Model** processes transformed features, generates predictions
6. **Response** flows back: Model → Signature → API → Server → Client

**Architecture Benefits**:
- **Scalability**: Docker containers dapat replicated untuk handle high traffic
- **Isolation**: Each service runs independently dalam isolated environment
- **Portability**: Docker image dapat deployed anywhere (local, cloud, on-premise)
- **Version Control**: Multiple model versions dapat served simultaneously
- **Monitoring**: Centralized logging dan metrics collection

---

## Program-Program Implementasi

### Program 1: Complete Data Pipeline Setup

```python
import os
import tensorflow as tf
from tfx.components import CsvExampleGen, StatisticsGen, SchemaGen, Transform
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext
import absl.logging

# Setup
absl.logging.set_verbosity(absl.logging.INFO)
_pipeline_root = os.path.join(os.getcwd(), 'pipeline', 'examples', 'forest_fires_pipeline')

# Create context
context = InteractiveContext(
    pipeline_name="forest_fires",
    pipeline_root=_pipeline_root
)

# 1. Load data dari CSV
example_gen = CsvExampleGen(input_base=os.path.join('data', 'csv', 'train'))
context.run(example_gen)

# 2. Generate statistics
statistics_gen = StatisticsGen(examples=example_gen.outputs['examples'])
context.run(statistics_gen)
context.show(statistics_gen.outputs['statistics'])

# 3. Infer schema
schema_gen = SchemaGen(
    statistics=statistics_gen.outputs['statistics'],
    infer_feature_shape=False
)
context.run(schema_gen)
context.show(schema_gen.outputs['schema'])

# 4. Transform features
transform = Transform(
    examples=example_gen.outputs['examples'],
    schema=schema_gen.outputs['schema'],
    module_file=os.path.abspath('forest_fires_transform.py')
)
context.run(transform)
```

**Penjelasan**: Program establish complete data pipeline dari raw CSV hingga transformed features. CsvExampleGen split data, StatisticsGen generate visualizations, SchemaGen infer structure, Transform convert columns ke features. Each component output fed ke next component.

---

### Program 2: Feature Transformation Module

```python
# File: forest_fires_transform.py
import tensorflow as tf
import tensorflow_transform as tft
import forest_fires_constants

_DENSE_FLOAT_FEATURE_KEYS = forest_fires_constants.DENSE_FLOAT_FEATURE_KEYS
_VOCAB_FEATURE_KEYS = forest_fires_constants.VOCAB_FEATURE_KEYS
_BUCKET_FEATURE_KEYS = forest_fires_constants.BUCKET_FEATURE_KEYS
_BUCKET_FEATURE_BOUNDARIES = forest_fires_constants.BUCKET_FEATURE_BOUNDARIES
_LABEL_KEY = forest_fires_constants.LABEL_KEY
_transformed_name = forest_fires_constants.transformed_name

def preprocessing_fn(inputs):
    """Convert raw data columns ke model-ready features"""
    outputs = {}
    
    # Dense float features: Z-score normalization
    for key in _DENSE_FLOAT_FEATURE_KEYS:
        outputs[_transformed_name(key)] = tft.scale_to_z_score(
            sparse_to_dense(inputs[key])
        )
    
    # Vocabulary-based categorical: Integer ID mapping
    for key in _VOCAB_FEATURE_KEYS:
        outputs[_transformed_name(key)] = tft.compute_and_apply_vocabulary(
            sparse_to_dense(inputs[key]),
            num_oov_buckets=1  # Handle unseen values
        )
    
    # Bucketized features: Assign ke bins
    for key, boundary in zip(_BUCKET_FEATURE_KEYS, _BUCKET_FEATURE_BOUNDARIES):
        outputs[_transformed_name(key)] = tft.apply_buckets(
            sparse_to_dense(inputs[key]),
            bucket_boundaries=[boundary]
        )
    
    # Label: Keep as numerical
    outputs[_transformed_name(_LABEL_KEY)] = sparse_to_dense(inputs[_LABEL_KEY])
    
    return outputs

def sparse_to_dense(x):
    """Convert SparseTensor ke DenseTensor"""
    return tf.squeeze(
        tf.sparse.to_dense(
            tf.SparseTensor(x.indices, x.values, [x.dense_shape[0], 1])
        ),
        axis=1
    )
```

**Penjelasan**: Module define preprocessing_fn yang required oleh TFX Transform component. Three types transformations: Z-score normalization untuk continuous features, vocabulary mapping untuk categorical strings, bucketization untuk discretizing continuous values. sparse_to_dense utility function handle sparse tensor conversion.

---

### Program 3: Model Definition dengan Feature Columns

```python
# File: forest_fires_trainer.py (part 1)
import tensorflow as tf
import forest_fires_constants

_DENSE_FLOAT_FEATURE_KEYS = forest_fires_constants.DENSE_FLOAT_FEATURE_KEYS
_VOCAB_FEATURE_KEYS = forest_fires_constants.VOCAB_FEATURE_KEYS
_BUCKET_FEATURE_KEYS = forest_fires_constants.BUCKET_FEATURE_KEYS
_BUCKET_FEATURE_BOUNDARIES = forest_fires_constants.BUCKET_FEATURE_BOUNDARIES
_MAX_CATEGORICAL_FEATURE_VALUES = forest_fires_constants.MAX_CATEGORICAL_FEATURE_VALUES
_transformed_names = lambda keys: [key + '_xf' for key in keys]

def _build_keras_model():
    """Build regression neural network dengan feature columns"""
    
    # Dense float feature columns
    real_valued_columns = [
        tf.feature_column.numeric_column(key=key, shape=(1,))
        for key in _transformed_names(_DENSE_FLOAT_FEATURE_KEYS)
    ]
    
    # Bucketized feature columns (one-hot encoded)
    categorical_columns = [
        tf.feature_column.indicator_column(
            tf.feature_column.categorical_column_with_identity(
                key,
                num_buckets=len(boundaries)+1
            )
        )
        for key, boundaries in zip(
            _transformed_names(_BUCKET_FEATURE_KEYS),
            _BUCKET_FEATURE_BOUNDARIES
        )
    ]
    
    # Vocab-based categorical columns (one-hot encoded)
    categorical_columns += [
        tf.feature_column.indicator_column(
            tf.feature_column.categorical_column_with_identity(
                key,
                num_buckets=num_buckets,
                default_value=num_buckets-1  # OOV handling
            )
        )
        for key, num_buckets in zip(
            _transformed_names(_VOCAB_FEATURE_KEYS),
            _MAX_CATEGORICAL_FEATURE_VALUES
        )
    ]
    
    # Build DNN regressor
    model = _dnn_regressor(
        columns=real_valued_columns + categorical_columns,
        dnn_hidden_units=[128, 64]
    )
    
    return model

def _dnn_regressor(columns, dnn_hidden_units):
    """Define deep neural network architecture"""
    
    # Input layers untuk each feature type
    input_layers = {
        colname: tf.keras.layers.Input(name=colname, shape=(), dtype=tf.float32)
        for colname in _transformed_names(_DENSE_FLOAT_FEATURE_KEYS)
    }
    input_layers.update({
        colname: tf.keras.layers.Input(name=colname, shape=(), dtype='int32')
        for colname in _transformed_names(_VOCAB_FEATURE_KEYS)
    })
    input_layers.update({
        colname: tf.keras.layers.Input(name=colname, shape=(), dtype='int32')
        for colname in _transformed_names(_BUCKET_FEATURE_KEYS)
    })
    
    # DenseFeatures layer aggregate all inputs
    output = tf.keras.layers.DenseFeatures(columns)(input_layers)
    
    # Hidden layers dengan tanh activation
    for numnodes in dnn_hidden_units:
        output = tf.keras.layers.Dense(numnodes, activation='tanh')(output)
    
    # Regression output (linear activation)
    output = tf.keras.layers.Dense(1)(output)
    
    # Compile model
    model = tf.keras.Model(input_layers, output)
    model.compile(
        loss='mean_squared_error',
        optimizer=tf.keras.optimizers.Adam(lr=0.001)
    )
    
    return model
```

**Penjelasan**: Program define Keras model dengan feature columns approach. real_valued_columns untuk continuous features, categorical_columns untuk discretized/categorical features dengan one-hot encoding. DenseFeatures layer aggregate all Input layers ke single tensor. Architecture: Input → DenseFeatures → Dense(128) → Dense(64) → Dense(1). MSE loss untuk regression.

---

### Program 4: Training Function dengan Signatures

```python
# File: forest_fires_trainer.py (part 2)
import tensorflow_transform as tft
import tfx.components
from typing import List, Text

def _input_fn(file_pattern: List[Text],
              data_accessor: tfx.components.DataAccessor,
              tf_transform_output: tft.TFTransformOutput,
              batch_size: int = 200):
    """Generate tf.data.Dataset dari TFRecord files"""
    return data_accessor.tf_dataset_factory(
        file_pattern,
        tfxio.TensorFlowDatasetOptions(
            batch_size=batch_size,
            label_key=_transformed_name(_LABEL_KEY)
        ),
        tf_transform_output.transformed_metadata.schema
    )

def _get_serve_tf_examples_fn(model, tf_transform_output):
    """Create signature function untuk serving"""
    model.tft_layer = tf_transform_output.transform_features_layer()
    
    @tf.function
    def serve_tf_examples_fn(serialized_tf_examples):
        # Get feature specs dan remove label
        feature_spec = tf_transform_output.raw_feature_spec()
        feature_spec.pop(_LABEL_KEY)
        
        # Parse serialized examples
        parsed_features = tf.io.parse_example(
            serialized_tf_examples,
            feature_spec
        )
        
        # Transform features
        transformed_features = model.tft_layer(parsed_features)
        
        # Return predictions
        return model(transformed_features)
    
    return serve_tf_examples_fn

def run_fn(fn_args: tfx.components.FnArgs):
    """Main training function called by TFX Trainer"""
    
    # Load transform graph
    tf_transform_output = tft.TFTransformOutput(fn_args.transform_graph_path)
    
    # Create datasets
    train_dataset = _input_fn(
        fn_args.train_files,
        fn_args.data_accessor,
        tf_transform_output,
        batch_size=40
    )
    eval_dataset = _input_fn(
        fn_args.eval_files,
        fn_args.data_accessor,
        tf_transform_output,
        batch_size=40
    )
    
    # Build model
    model = _build_keras_model()
    
    # Setup CSV logger
    csv_write_dir = os.path.join(fn_args.model_run_dir, 'model_performance')
    os.makedirs(csv_write_dir, exist_ok=True)
    csv_callback = tf.keras.callbacks.CSVLogger(
        os.path.join(csv_write_dir, 'performance.csv'),
        append=False
    )
    
    # Train model
    model.fit(
        train_dataset,
        steps_per_epoch=fn_args.train_steps,
        validation_data=eval_dataset,
        validation_steps=fn_args.eval_steps,
        epochs=10,
        callbacks=[csv_callback]
    )
    
    # Define serving signatures
    signatures = {
        'serving_default': _get_serve_tf_examples_fn(
            model, tf_transform_output
        ).get_concrete_function(
            tf.TensorSpec(shape=[None], dtype=tf.string, name='examples')
        )
    }
    
    # Save model dengan signatures
    model.save(
        fn_args.serving_model_dir,
        save_format='tf',
        signatures=signatures
    )
```

**Penjelasan**: run_fn orchestrate entire training process. _input_fn create tf.data.Dataset dari files, _get_serve_tf_examples_fn define signature untuk handle API requests. Training dengan CSVLogger untuk track performance. Signatures critical untuk TensorFlow Serving—define how model processes incoming HTTP requests. get_concrete_function trace function tanpa execution untuk create serving signature.

---

### Program 5: Deploying dan Testing Model via API

```python
# Dockerfile untuk TensorFlow Serving
"""
FROM tensorflow/serving:2.6.0

ENV MODEL_BASE_PATH=/models
ENV MODEL_NAME=forest_fires_model

COPY <pipeline_root>/Trainer/model/<execution_id>/Format-Serving \
     ${MODEL_BASE_PATH}/${MODEL_NAME}/1

EXPOSE 8501

CMD ["tensorflow_model_server", \
     "--rest_api_port=8501", \
     "--model_name=${MODEL_NAME}", \
     "--model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME}"]
"""

# Build Docker image
# Terminal: docker build -t forest-fires-model-server:latest .

# Run Docker container
# Terminal: docker run -p 8501:8501 --name forest-fires-server \
#                      forest-fires-model-server:latest

# Python client untuk send requests
import base64
import json
import requests

# Prepare input data
input_data = {
    "X": 7, "Y": 4,
    "month": "oct", "day": "fri",
    "FFMC": 60, "DMC": 30, "DC": 200, "ISI": 9,
    "temp": 30, "RH": 50, "wind": 10, "rain": 0
}

# Create request body
req_body = {
    "signature_name": "serving_default",
    "instances": [
        str(base64.b64encode(json.dumps(input_data).encode('utf-8')))
    ]
}

# Send POST request
response = requests.post(
    'http://localhost:8501/v1/models/forest_fires_model:predict',
    data=json.dumps(req_body),
    headers={"content-type": "application/json"}
)

# Parse predictions
if response.status_code == 200:
    predictions = json.loads(response.text)
    print(f"Predicted burnt area: {predictions['predictions'][0][0]:.2f} hectares")
else:
    print(f"Error: {response.status_code}")
    print(response.text)

# Get model metadata
metadata_response = requests.get(
    'http://localhost:8501/v1/models/forest_fires_model/metadata'
)
if metadata_response.status_code == 200:
    metadata = json.loads(metadata_response.text)
    print("Available signatures:", metadata['metadata']['signature_def'].keys())
```

**Penjelasan**: Complete deployment workflow. Dockerfile define container dengan TensorFlow Serving, model copied dari pipeline output. Docker build create image, docker run launch container expose port 8501. Python client construct request dengan base64-encoded input, send POST request ke predict endpoint. Response contains predictions. Metadata endpoint provide info tentang available signatures. Architecture enable scalable, production-ready model serving.

---

## Kesimpulan

TFX menyediakan robust framework untuk implement end-to-end MLOps workflows, dramatically reducing time untuk productionize machine learning models. Key advantages: automated data validation dengan StatisticsGen dan SchemaGen prevent bad data dari corrupting models, feature transformations dengan tensorflow_transform ensure consistent preprocessing antara training dan serving, TFX Trainer API standardize model training dengan support untuk custom training loops, SignatureDefs enable flexible model serving behavior untuk different use cases, Docker containerization ensure portability dan isolation, TensorFlow Serving provide production-grade model serving dengan REST API interface. Complete pipeline—dari raw CSV hingga deployed model accessible via HTTP requests—demonstrate power TFX untuk streamline ML workflows di production environments. Untuk companies dengan dozens atau hundreds models, MLOps dengan TFX dramatically reduce operational overhead, improve model quality dengan systematic validation, dan accelerate deployment cycles. Future directions: advanced validation checks, A/B testing frameworks, model monitoring dashboards, automated retraining triggers, multi-model serving optimization.