# üìô FILE 3-B: CLEAN ML PIPELINE & MODEL EVALUATION

**Ph·∫ßn:** ADVANCED & PROFESSIONAL (Production-Ready)

**M·ª•c ti√™u:**
- ‚úÖ X√¢y d·ª±ng Clean ML Pipeline
- ‚úÖ ƒê·∫£m b·∫£o Reproducibility
- ‚úÖ Model Evaluation chuy√™n nghi·ªáp
- ‚úÖ Metrics cho c√°c b√†i to√°n kh√°c nhau
- ‚úÖ Best practices cho production

**Th·ªùi l∆∞·ª£ng:** 2-3 tu·∫ßn

---

## üìö M·ª•c L·ª•c

### PH·∫¶N 1: CLEAN ML PIPELINE
1. ML Pipeline l√† g√¨?
2. Pipeline Components
3. Data Pipeline
4. Training Pipeline
5. Config Management
6. Experiment Tracking

### PH·∫¶N 2: REPRODUCIBILITY
1. Reproducibility l√† g√¨?
2. Random Seeds
3. Version Control
4. Environment Management
5. Data Versioning

### PH·∫¶N 3: MODEL EVALUATION & METRICS
1. Classification Metrics
2. Regression Metrics
3. Confusion Matrix Analysis
4. ROC & AUC
5. Cross-validation
6. Model Comparison

---

In [None]:
# Import c√°c th∆∞ vi·ªán c·∫ßn thi·∫øt
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import json
import yaml
import pickle
from datetime import datetime
import random
import os

# Sklearn metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report,
    roc_curve, auc, roc_auc_score,
    mean_squared_error, mean_absolute_error, r2_score
)
from sklearn.model_selection import KFold, StratifiedKFold

print(f"‚úÖ TensorFlow version: {tf.__version__}")
print(f"‚úÖ NumPy version: {np.__version__}")

---

# PH·∫¶N 1: CLEAN ML PIPELINE

## 1.1 ML Pipeline l√† g√¨?

### ƒê·ªãnh nghƒ©a

**ML Pipeline** = Quy tr√¨nh t·ª± ƒë·ªông h√≥a c√°c b∆∞·ªõc trong ML workflow

### T·∫°i sao c·∫ßn ML Pipeline?

| V·∫•n ƒë·ªÅ | Gi·∫£i ph√°p v·ªõi Pipeline |
|--------|------------------------|
| üîÑ **Reproducibility** | Quy tr√¨nh c·ªë ƒë·ªãnh, d·ªÖ l·∫∑p l·∫°i |
| üêõ **Debugging** | D·ªÖ t√¨m l·ªói, test t·ª´ng component |
| üìä **Scaling** | D·ªÖ scale l√™n production |
| üë• **Collaboration** | Code r√µ r√†ng, d·ªÖ l√†m vi·ªác nh√≥m |
| üîß **Maintenance** | D·ªÖ update, maintain |

### Components c·ªßa ML Pipeline

```
Data Loading ‚Üí Preprocessing ‚Üí Augmentation ‚Üí Training ‚Üí Evaluation ‚Üí Deployment
```

### Pipeline t·ªët vs Pipeline x·∫•u

#### ‚ùå Pipeline X·∫§U
```python
# Notebook messy, code r·∫£i r√°c
# Magic numbers everywhere
# Kh√¥ng c√≥ config
# Kh√¥ng track experiments
# Kh√¥ng reproducible
```

#### ‚úÖ Pipeline T·ªêT
```python
# Modular, organized
# Config-driven
# Logging & tracking
# Version control
# Reproducible
# Documented
```

## 1.2 Config Management

### T·∫°i sao c·∫ßn Config?

- ‚úÖ T√°ch code v√† parameters
- ‚úÖ D·ªÖ thay ƒë·ªïi hyperparameters
- ‚úÖ D·ªÖ share experiments
- ‚úÖ Version control cho configs

### Config File Format

C√≥ th·ªÉ d√πng:
- **JSON**: ƒê∆°n gi·∫£n, d·ªÖ ƒë·ªçc
- **YAML**: D·ªÖ ƒë·ªçc h∆°n JSON, support comments
- **Python dict**: Linh ho·∫°t nh·∫•t

Khuy·∫øn ngh·ªã: **YAML** cho d·ªÖ ƒë·ªçc v√† comments

In [None]:
# Example: Config v·ªõi Python dict

class Config:
    """Configuration for ML pipeline"""
    
    # Data
    DATA_DIR = './data'
    IMG_SIZE = 224
    BATCH_SIZE = 32
    
    # Model
    MODEL_NAME = 'MobileNetV2'
    NUM_CLASSES = 2
    DROPOUT_RATE = 0.2
    
    # Training
    EPOCHS = 50
    LEARNING_RATE = 0.001
    OPTIMIZER = 'adam'
    LOSS = 'binary_crossentropy'
    
    # Callbacks
    EARLY_STOPPING_PATIENCE = 10
    REDUCE_LR_PATIENCE = 5
    REDUCE_LR_FACTOR = 0.5
    
    # Paths
    MODEL_DIR = './models'
    LOG_DIR = './logs'
    CHECKPOINT_DIR = './checkpoints'
    
    # Random seed
    SEED = 42
    
    @classmethod
    def to_dict(cls):
        """Convert config to dictionary"""
        return {k: v for k, v in cls.__dict__.items() 
                if not k.startswith('_') and not callable(v)}
    
    @classmethod
    def save(cls, path):
        """Save config to JSON file"""
        with open(path, 'w') as f:
            json.dump(cls.to_dict(), f, indent=2)
        print(f"‚úÖ Config saved to {path}")
    
    @classmethod
    def load(cls, path):
        """Load config from JSON file"""
        with open(path, 'r') as f:
            config_dict = json.load(f)
        
        for key, value in config_dict.items():
            setattr(cls, key, value)
        
        print(f"‚úÖ Config loaded from {path}")

# Test
config = Config()
print("üìã Current Config:")
print(json.dumps(config.to_dict(), indent=2))

In [None]:
# Example: Config v·ªõi YAML (khuy·∫øn ngh·ªã)

config_yaml = """
# ML Pipeline Configuration

# Data Configuration
data:
  data_dir: ./data
  img_size: 224
  batch_size: 32
  validation_split: 0.2
  
# Model Configuration  
model:
  name: MobileNetV2
  num_classes: 2
  dropout_rate: 0.2
  pretrained: true
  
# Training Configuration
training:
  epochs: 50
  learning_rate: 0.001
  optimizer: adam
  loss: binary_crossentropy
  metrics:
    - accuracy
    - precision
    - recall
  
# Callbacks
callbacks:
  early_stopping:
    patience: 10
    restore_best_weights: true
  reduce_lr:
    patience: 5
    factor: 0.5
    min_lr: 1.0e-7
  model_checkpoint:
    save_best_only: true
    
# Paths
paths:
  model_dir: ./models
  log_dir: ./logs
  checkpoint_dir: ./checkpoints
  
# Reproducibility
seed: 42
"""

# Save config
with open('config.yaml', 'w') as f:
    f.write(config_yaml)

# Load config
with open('config.yaml', 'r') as f:
    config = yaml.safe_load(f)

print("‚úÖ Config loaded from YAML:")
print(json.dumps(config, indent=2))

print("\nüí° ∆Øu ƒëi·ªÉm c·ªßa YAML:")
print("   ‚úÖ D·ªÖ ƒë·ªçc h∆°n JSON")
print("   ‚úÖ Support comments")
print("   ‚úÖ Hierarchical structure")
print("   ‚úÖ D·ªÖ edit b·∫±ng text editor")

## 1.3 Data Pipeline

### Clean Data Pipeline Structure

```python
class DataPipeline:
    def __init__(self, config)
    def load_data()
    def preprocess()
    def augment()
    def create_dataset()
```

In [None]:
class DataPipeline:
    """Clean data pipeline for image classification"""
    
    def __init__(self, config):
        """
        Args:
            config: Configuration dictionary
        """
        self.config = config
        self.img_size = config['data']['img_size']
        self.batch_size = config['data']['batch_size']
        self.data_dir = config['data']['data_dir']
        
    def create_preprocessing_fn(self):
        """Create preprocessing function"""
        def preprocess(image, label):
            # Resize
            image = tf.image.resize(image, (self.img_size, self.img_size))
            # Normalize to [0, 1]
            image = image / 255.0
            return image, label
        return preprocess
    
    def create_augmentation_fn(self):
        """Create augmentation function for training"""
        def augment(image, label):
            # Random flip
            image = tf.image.random_flip_left_right(image)
            # Random brightness
            image = tf.image.random_brightness(image, max_delta=0.2)
            # Random contrast
            image = tf.image.random_contrast(image, lower=0.8, upper=1.2)
            # Clip values
            image = tf.clip_by_value(image, 0.0, 1.0)
            return image, label
        return augment
    
    def create_dataset(self, file_pattern, is_training=True):
        """
        Create tf.data.Dataset
        
        Args:
            file_pattern: Pattern for files (e.g., 'train/*.jpg')
            is_training: Whether this is training data
        
        Returns:
            tf.data.Dataset
        """
        # Load dataset (example v·ªõi image_dataset_from_directory)
        dataset = keras.utils.image_dataset_from_directory(
            self.data_dir,
            image_size=(self.img_size, self.img_size),
            batch_size=self.batch_size,
            shuffle=is_training
        )
        
        # Preprocessing
        preprocess_fn = self.create_preprocessing_fn()
        dataset = dataset.map(preprocess_fn, num_parallel_calls=tf.data.AUTOTUNE)
        
        # Augmentation (ch·ªâ cho training)
        if is_training:
            augment_fn = self.create_augmentation_fn()
            dataset = dataset.map(augment_fn, num_parallel_calls=tf.data.AUTOTUNE)
        
        # Performance optimization
        dataset = dataset.cache()  # Cache sau khi augment
        dataset = dataset.prefetch(tf.data.AUTOTUNE)
        
        return dataset
    
    def get_train_val_datasets(self, validation_split=0.2):
        """
        Get training and validation datasets
        
        Args:
            validation_split: Fraction of data for validation
        
        Returns:
            train_dataset, val_dataset
        """
        # Implementation depends on data structure
        pass

print("‚úÖ DataPipeline class defined!")
print("\nüìö Features:")
print("   ‚úÖ Config-driven")
print("   ‚úÖ Modular (preprocess, augment separate)")
print("   ‚úÖ Performance optimized (cache, prefetch)")
print("   ‚úÖ Easy to test v√† maintain")

## 1.4 Training Pipeline

### Clean Training Pipeline Structure

In [None]:
class TrainingPipeline:
    """Clean training pipeline"""
    
    def __init__(self, config):
        self.config = config
        self.model = None
        self.history = None
        
        # Create directories
        self._create_directories()
        
        # Set seeds for reproducibility
        self._set_seeds(config['seed'])
    
    def _create_directories(self):
        """Create necessary directories"""
        paths = self.config['paths']
        for path in paths.values():
            Path(path).mkdir(parents=True, exist_ok=True)
        print("‚úÖ Directories created")
    
    def _set_seeds(self, seed):
        """Set random seeds for reproducibility"""
        np.random.seed(seed)
        tf.random.set_seed(seed)
        random.seed(seed)
        os.environ['PYTHONHASHSEED'] = str(seed)
        print(f"‚úÖ Seeds set to {seed}")
    
    def build_model(self):
        """Build model from config"""
        model_config = self.config['model']
        
        # Base model
        if model_config['name'] == 'MobileNetV2':
            from tensorflow.keras.applications import MobileNetV2
            base_model = MobileNetV2(
                input_shape=(224, 224, 3),
                include_top=False,
                weights='imagenet' if model_config['pretrained'] else None
            )
            base_model.trainable = False
        
        # Build complete model
        inputs = keras.Input(shape=(224, 224, 3))
        x = base_model(inputs, training=False)
        x = layers.GlobalAveragePooling2D()(x)
        x = layers.Dropout(model_config['dropout_rate'])(x)
        outputs = layers.Dense(model_config['num_classes'], activation='softmax')(x)
        
        self.model = keras.Model(inputs, outputs)
        print("‚úÖ Model built")
        return self.model
    
    def compile_model(self):
        """Compile model from config"""
        train_config = self.config['training']
        
        # Optimizer
        if train_config['optimizer'] == 'adam':
            optimizer = keras.optimizers.Adam(
                learning_rate=train_config['learning_rate']
            )
        
        # Compile
        self.model.compile(
            optimizer=optimizer,
            loss=train_config['loss'],
            metrics=train_config['metrics']
        )
        print("‚úÖ Model compiled")
    
    def create_callbacks(self):
        """Create callbacks from config"""
        cb_config = self.config['callbacks']
        paths = self.config['paths']
        
        callbacks = []
        
        # Early Stopping
        callbacks.append(keras.callbacks.EarlyStopping(
            monitor='val_loss',
            patience=cb_config['early_stopping']['patience'],
            restore_best_weights=cb_config['early_stopping']['restore_best_weights'],
            verbose=1
        ))
        
        # ReduceLROnPlateau
        callbacks.append(keras.callbacks.ReduceLROnPlateau(
            monitor='val_loss',
            patience=cb_config['reduce_lr']['patience'],
            factor=cb_config['reduce_lr']['factor'],
            min_lr=cb_config['reduce_lr']['min_lr'],
            verbose=1
        ))
        
        # ModelCheckpoint
        checkpoint_path = Path(paths['checkpoint_dir']) / 'best_model.keras'
        callbacks.append(keras.callbacks.ModelCheckpoint(
            str(checkpoint_path),
            monitor='val_accuracy',
            save_best_only=cb_config['model_checkpoint']['save_best_only'],
            verbose=1
        ))
        
        # TensorBoard
        log_dir = Path(paths['log_dir']) / datetime.now().strftime('%Y%m%d-%H%M%S')
        callbacks.append(keras.callbacks.TensorBoard(
            log_dir=str(log_dir),
            histogram_freq=1
        ))
        
        print(f"‚úÖ Created {len(callbacks)} callbacks")
        return callbacks
    
    def train(self, train_dataset, val_dataset):
        """Train model"""
        train_config = self.config['training']
        callbacks = self.create_callbacks()
        
        print("\nüöÄ Starting training...\n")
        
        self.history = self.model.fit(
            train_dataset,
            validation_data=val_dataset,
            epochs=train_config['epochs'],
            callbacks=callbacks,
            verbose=1
        )
        
        print("\n‚úÖ Training completed!")
        return self.history
    
    def save_model(self, name='final_model'):
        """Save model"""
        model_path = Path(self.config['paths']['model_dir']) / f'{name}.keras'
        self.model.save(str(model_path))
        print(f"‚úÖ Model saved to {model_path}")
    
    def save_history(self, name='history'):
        """Save training history"""
        history_path = Path(self.config['paths']['log_dir']) / f'{name}.json'
        with open(history_path, 'w') as f:
            json.dump(self.history.history, f, indent=2)
        print(f"‚úÖ History saved to {history_path}")

print("‚úÖ TrainingPipeline class defined!")
print("\nüìö Features:")
print("   ‚úÖ Config-driven (flexible)")
print("   ‚úÖ Reproducible (seeds, versioning)")
print("   ‚úÖ Organized (directories, logging)")
print("   ‚úÖ Easy to use")

## 1.5 Experiment Tracking

### T·∫°i sao c·∫ßn Experiment Tracking?

- ‚úÖ Track hyperparameters
- ‚úÖ Compare experiments
- ‚úÖ Reproduce results
- ‚úÖ Share v·ªõi team

### Tools ph·ªï bi·∫øn

- **TensorBoard**: Built-in TensorFlow, free
- **MLflow**: Open-source, full-featured
- **Weights & Biases**: Cloud-based, powerful
- **Neptune.ai**: Cloud-based

### Simple Experiment Tracker

In [None]:
class ExperimentTracker:
    """Simple experiment tracker"""
    
    def __init__(self, experiment_name, log_dir='./experiments'):
        self.experiment_name = experiment_name
        self.log_dir = Path(log_dir)
        self.experiment_dir = self.log_dir / experiment_name
        self.experiment_dir.mkdir(parents=True, exist_ok=True)
        
        # Experiment metadata
        self.metadata = {
            'name': experiment_name,
            'timestamp': datetime.now().isoformat(),
            'config': {},
            'metrics': {},
            'artifacts': []
        }
    
    def log_config(self, config):
        """Log configuration"""
        self.metadata['config'] = config
        self._save_metadata()
    
    def log_metric(self, name, value, step=None):
        """Log a metric"""
        if name not in self.metadata['metrics']:
            self.metadata['metrics'][name] = []
        
        self.metadata['metrics'][name].append({
            'value': value,
            'step': step,
            'timestamp': datetime.now().isoformat()
        })
        self._save_metadata()
    
    def log_metrics(self, metrics_dict, step=None):
        """Log multiple metrics"""
        for name, value in metrics_dict.items():
            self.log_metric(name, value, step)
    
    def log_artifact(self, artifact_path, artifact_type='file'):
        """Log an artifact (model, plot, etc.)"""
        self.metadata['artifacts'].append({
            'path': str(artifact_path),
            'type': artifact_type,
            'timestamp': datetime.now().isoformat()
        })
        self._save_metadata()
    
    def _save_metadata(self):
        """Save metadata to JSON"""
        metadata_path = self.experiment_dir / 'metadata.json'
        with open(metadata_path, 'w') as f:
            json.dump(self.metadata, f, indent=2, default=str)
    
    def get_summary(self):
        """Get experiment summary"""
        summary = {
            'name': self.metadata['name'],
            'timestamp': self.metadata['timestamp'],
            'config': self.metadata['config'],
            'final_metrics': {}
        }
        
        # Get final metric values
        for name, values in self.metadata['metrics'].items():
            if values:
                summary['final_metrics'][name] = values[-1]['value']
        
        return summary
    
    @staticmethod
    def compare_experiments(experiment_names, log_dir='./experiments'):
        """Compare multiple experiments"""
        log_dir = Path(log_dir)
        experiments = []
        
        for name in experiment_names:
            metadata_path = log_dir / name / 'metadata.json'
            if metadata_path.exists():
                with open(metadata_path, 'r') as f:
                    metadata = json.load(f)
                    experiments.append(metadata)
        
        return experiments

# Example usage
tracker = ExperimentTracker('exp_001')
tracker.log_config({'learning_rate': 0.001, 'batch_size': 32})
tracker.log_metrics({'train_loss': 0.5, 'val_loss': 0.6}, step=0)
tracker.log_metrics({'train_loss': 0.3, 'val_loss': 0.4}, step=1)

print("‚úÖ ExperimentTracker demo!")
print("\nSummary:")
print(json.dumps(tracker.get_summary(), indent=2))

---

# PH·∫¶N 2: REPRODUCIBILITY

## 2.1 Reproducibility l√† g√¨?

### ƒê·ªãnh nghƒ©a

**Reproducibility** = Kh·∫£ nƒÉng t√°i t·∫°o l·∫°i k·∫øt qu·∫£ gi·ªëng nhau khi ch·∫°y l·∫°i code

### T·∫°i sao quan tr·ªçng?

| L√Ω do | Gi·∫£i th√≠ch |
|-------|------------|
| üî¨ **Research** | Validate k·∫øt qu·∫£ nghi√™n c·ª©u |
| üêõ **Debugging** | D·ªÖ t√¨m l·ªói khi k·∫øt qu·∫£ consistent |
| üë• **Collaboration** | Team c√≥ th·ªÉ reproduce |
| üìä **Production** | ƒê·∫£m b·∫£o model deployment gi·ªëng training |
| ‚úÖ **Trust** | TƒÉng ƒë·ªô tin c·∫≠y c·ªßa model |

### C√°c y·∫øu t·ªë ·∫£nh h∆∞·ªüng Reproducibility

1. **Random seeds** (NumPy, TensorFlow, Python)
2. **Data order** (shuffle)
3. **Hardware** (GPU, CPU)
4. **Software versions** (TensorFlow, CUDA)
5. **Environment** (dependencies)
6. **Data versioning**

## 2.2 Random Seeds

### C√°c ngu·ªìn randomness trong ML

1. **Python random**
2. **NumPy random**
3. **TensorFlow random**
4. **PYTHONHASHSEED** (Python dict ordering)

### C√°ch set seeds ƒë√∫ng

In [None]:
def set_seeds(seed=42):
    """
    Set all random seeds for reproducibility
    
    Args:
        seed: Random seed value
    """
    # Python random
    random.seed(seed)
    
    # NumPy random
    np.random.seed(seed)
    
    # TensorFlow random
    tf.random.set_seed(seed)
    
    # Python hash seed (for dict ordering)
    os.environ['PYTHONHASHSEED'] = str(seed)
    
    # Deterministic operations (TensorFlow 2.x)
    # ‚ö†Ô∏è Warning: C√≥ th·ªÉ ch·∫≠m h∆°n ~10%
    os.environ['TF_DETERMINISTIC_OPS'] = '1'
    os.environ['TF_CUDNN_DETERMINISTIC'] = '1'
    
    print(f"‚úÖ All seeds set to {seed}")
    print("‚ö†Ô∏è  Deterministic mode enabled (may be slower)")

# Test reproducibility
set_seeds(42)

# Generate random numbers
print("\nTest reproducibility:")
print(f"Python random: {random.random()}")
print(f"NumPy random: {np.random.rand()}")
print(f"TF random: {tf.random.normal([1]).numpy()}")

# Reset and generate again - should be same!
set_seeds(42)
print("\nAfter reset (should be same):")
print(f"Python random: {random.random()}")
print(f"NumPy random: {np.random.rand()}")
print(f"TF random: {tf.random.normal([1]).numpy()}")

## 2.3 Environment Management

### T·∫°i sao c·∫ßn Environment Management?

- ‚úÖ ƒê·∫£m b·∫£o software versions gi·ªëng nhau
- ‚úÖ Tr√°nh dependency conflicts
- ‚úÖ D·ªÖ share v·ªõi team

### Tools

- **pip + requirements.txt**: ƒê∆°n gi·∫£n
- **conda**: Qu·∫£n l√Ω c·∫£ non-Python packages
- **Docker**: Isolated environment
- **Poetry**: Modern Python packaging

### Best Practice: requirements.txt

In [None]:
# Generate requirements.txt
requirements = """
# Core ML
tensorflow==2.15.0
numpy==1.24.3
scikit-learn==1.3.0

# Visualization
matplotlib==3.7.2
seaborn==0.12.2

# Utilities
pyyaml==6.0.1
tqdm==4.66.1

# Experiment tracking
tensorboard==2.15.0
mlflow==2.8.0  # Optional
"""

with open('requirements.txt', 'w') as f:
    f.write(requirements.strip())

print("‚úÖ requirements.txt created!")
print("\nüìö C√°ch d√πng:")
print("   1. Install: pip install -r requirements.txt")
print("   2. Export: pip freeze > requirements.txt")
print("   3. Version specific: tensorflow==2.15.0")
print("   4. Version range: tensorflow>=2.13.0,<3.0.0")

In [None]:
# Log environment info
def log_environment_info(save_path='environment_info.json'):
    """
    Log environment information for reproducibility
    """
    import platform
    import sys
    
    env_info = {
        'timestamp': datetime.now().isoformat(),
        'python_version': sys.version,
        'platform': platform.platform(),
        'tensorflow_version': tf.__version__,
        'numpy_version': np.__version__,
        'gpu_available': len(tf.config.list_physical_devices('GPU')) > 0,
        'cuda_version': tf.sysconfig.get_build_info().get('cuda_version', 'N/A'),
        'cudnn_version': tf.sysconfig.get_build_info().get('cudnn_version', 'N/A')
    }
    
    # Save
    with open(save_path, 'w') as f:
        json.dump(env_info, f, indent=2)
    
    print("‚úÖ Environment info saved!")
    print("\nüìã Environment:")
    for key, value in env_info.items():
        print(f"   {key}: {value}")
    
    return env_info

# Log
env_info = log_environment_info()

## 2.4 Reproducibility Checklist

### ‚úÖ Before Training

- [ ] Set all random seeds
- [ ] Log environment info (Python, TensorFlow, CUDA versions)
- [ ] Save config file
- [ ] Version control code (Git)
- [ ] Document data source & version

### ‚úÖ During Training

- [ ] Log hyperparameters
- [ ] Save checkpoints
- [ ] Log metrics to TensorBoard/MLflow
- [ ] Track data preprocessing steps

### ‚úÖ After Training

- [ ] Save final model
- [ ] Save training history
- [ ] Document results
- [ ] Archive experiment (code + config + model)

### Template: Experiment Archive Structure

```
experiment_001/
‚îú‚îÄ‚îÄ code/
‚îÇ   ‚îú‚îÄ‚îÄ train.py
‚îÇ   ‚îî‚îÄ‚îÄ model.py
‚îú‚îÄ‚îÄ config.yaml
‚îú‚îÄ‚îÄ requirements.txt
‚îú‚îÄ‚îÄ environment_info.json
‚îú‚îÄ‚îÄ data/
‚îÇ   ‚îî‚îÄ‚îÄ data_version.txt
‚îú‚îÄ‚îÄ models/
‚îÇ   ‚îú‚îÄ‚îÄ best_model.keras
‚îÇ   ‚îî‚îÄ‚îÄ final_model.keras
‚îú‚îÄ‚îÄ logs/
‚îÇ   ‚îú‚îÄ‚îÄ training_history.json
‚îÇ   ‚îî‚îÄ‚îÄ tensorboard/
‚îî‚îÄ‚îÄ README.md
```

---

# PH·∫¶N 3: MODEL EVALUATION & METRICS

## 3.1 Classification Metrics

### Binary Classification Metrics

#### 1. Confusion Matrix

```
                 Predicted
              Negative  Positive
Actual  Neg      TN        FP
        Pos      FN        TP
```

- **TP (True Positive)**: D·ª± ƒëo√°n Positive, th·ª±c t·∫ø Positive ‚úÖ
- **TN (True Negative)**: D·ª± ƒëo√°n Negative, th·ª±c t·∫ø Negative ‚úÖ
- **FP (False Positive)**: D·ª± ƒëo√°n Positive, th·ª±c t·∫ø Negative ‚ùå (Type I Error)
- **FN (False Negative)**: D·ª± ƒëo√°n Negative, th·ª±c t·∫ø Positive ‚ùå (Type II Error)

#### 2. Metrics t·ª´ Confusion Matrix

| Metric | Formula | √ù nghƒ©a | Khi n√†o d√πng |
|--------|---------|---------|---------------|
| **Accuracy** | (TP+TN) / Total | T·ª∑ l·ªá d·ª± ƒëo√°n ƒë√∫ng | Dataset c√¢n b·∫±ng |
| **Precision** | TP / (TP+FP) | Trong c√°c d·ª± ƒëo√°n Positive, bao nhi√™u % ƒë√∫ng? | Minimize False Positive |
| **Recall (Sensitivity)** | TP / (TP+FN) | Trong c√°c Positive th·ª±c t·∫ø, bao nhi√™u % ƒë∆∞·ª£c t√¨m th·∫•y? | Minimize False Negative |
| **F1-Score** | 2 √ó (Precision √ó Recall) / (Precision + Recall) | Harmonic mean c·ªßa Precision v√† Recall | Dataset imbalanced |
| **Specificity** | TN / (TN+FP) | Trong c√°c Negative th·ª±c t·∫ø, bao nhi√™u % ƒë∆∞·ª£c t√¨m th·∫•y? | Medical diagnosis |

In [None]:
# Example: Binary Classification Evaluation

# Gi·∫£ s·ª≠ c√≥ predictions v√† ground truth
y_true = np.array([0, 0, 1, 1, 0, 1, 0, 1, 1, 0])
y_pred = np.array([0, 0, 1, 1, 0, 0, 0, 1, 1, 1])  # C√≥ m·ªôt s·ªë sai

# Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()

print("üìä CONFUSION MATRIX:")
print(cm)
print(f"\nBreakdown:")
print(f"  TN (True Negative):  {tn}")
print(f"  FP (False Positive): {fp}")
print(f"  FN (False Negative): {fn}")
print(f"  TP (True Positive):  {tp}")

# Calculate metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
specificity = tn / (tn + fp)

print("\nüìà METRICS:")
print("=" * 50)
print(f"Accuracy:    {accuracy:.4f}  ({accuracy*100:.2f}%)")
print(f"Precision:   {precision:.4f}  ({precision*100:.2f}%)")
print(f"Recall:      {recall:.4f}  ({recall*100:.2f}%)")
print(f"F1-Score:    {f1:.4f}  ({f1*100:.2f}%)")
print(f"Specificity: {specificity:.4f}  ({specificity*100:.2f}%)")
print("=" * 50)

In [None]:
# Visualize Confusion Matrix
def plot_confusion_matrix(y_true, y_pred, class_names=None, normalize=False):
    """
    Plot confusion matrix v·ªõi visualization ƒë·∫πp
    
    Args:
        y_true: True labels
        y_pred: Predicted labels
        class_names: Class names for labels
        normalize: Normalize by row (recall)
    """
    cm = confusion_matrix(y_true, y_pred)
    
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        fmt = '.2f'
        title = 'Normalized Confusion Matrix'
    else:
        fmt = 'd'
        title = 'Confusion Matrix'
    
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt=fmt, cmap='Blues', 
                xticklabels=class_names, yticklabels=class_names,
                cbar_kws={'label': 'Count' if not normalize else 'Proportion'})
    plt.xlabel('Predicted Label', fontsize=12)
    plt.ylabel('True Label', fontsize=12)
    plt.title(title, fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

# Plot
plot_confusion_matrix(y_true, y_pred, class_names=['Negative', 'Positive'])
plot_confusion_matrix(y_true, y_pred, class_names=['Negative', 'Positive'], normalize=True)

## 3.2 ROC Curve & AUC

### ROC Curve l√† g√¨?

**ROC (Receiver Operating Characteristic)** = ƒê·ªì th·ªã bi·ªÉu di·ªÖn trade-off gi·ªØa True Positive Rate v√† False Positive Rate ·ªü c√°c threshold kh√°c nhau

- **X-axis**: False Positive Rate (FPR) = FP / (FP + TN)
- **Y-axis**: True Positive Rate (TPR) = TP / (TP + FN) = Recall

### AUC (Area Under Curve)

- **AUC = 1.0**: Perfect classifier ‚úÖ
- **AUC = 0.9-1.0**: Excellent
- **AUC = 0.8-0.9**: Good
- **AUC = 0.7-0.8**: Fair
- **AUC = 0.5**: Random classifier
- **AUC < 0.5**: Worse than random

### Khi n√†o d√πng ROC & AUC?

‚úÖ Binary classification
‚úÖ Imbalanced dataset
‚úÖ C·∫ßn ƒë√°nh gi√° t·ªïng qu√°t (kh√¥ng ph·ª• thu·ªôc threshold)
‚úÖ So s√°nh nhi·ªÅu models

In [None]:
# Example: ROC Curve

# Gi·∫£ s·ª≠ c√≥ probability predictions
y_true = np.array([0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1])
y_pred_proba = np.array([0.1, 0.2, 0.7, 0.8, 0.3, 0.6, 0.2, 0.9, 0.85, 0.4, 0.75, 0.82, 0.15, 0.25, 0.88])

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_true, y_pred_proba)
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, 
         label=f'ROC curve (AUC = {roc_auc:.4f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', 
         label='Random classifier (AUC = 0.5)')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate (Recall)', fontsize=12)
plt.title('ROC Curve', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"üìä AUC Score: {roc_auc:.4f}")

if roc_auc >= 0.9:
    print("‚úÖ Excellent classifier!")
elif roc_auc >= 0.8:
    print("‚úÖ Good classifier")
elif roc_auc >= 0.7:
    print("‚ö†Ô∏è  Fair classifier")
else:
    print("‚ùå Poor classifier")

## 3.3 Multi-class Classification Metrics

### Metrics cho Multi-class

#### 1. Micro-average
- T√≠nh t·ªïng TP, FP, FN t·ª´ t·∫•t c·∫£ classes
- D√πng khi classes c√≥ size kh√°c nhau

#### 2. Macro-average
- T√≠nh metric cho t·ª´ng class, r·ªìi average
- Treat all classes equally
- D√πng khi mu·ªën classes c√≥ weight b·∫±ng nhau

#### 3. Weighted-average
- T√≠nh metric cho t·ª´ng class, weighted by support
- D√πng khi imbalanced dataset

In [None]:
# Example: Multi-class Classification

# Gi·∫£ s·ª≠ c√≥ 3 classes
y_true_mc = np.array([0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2])
y_pred_mc = np.array([0, 1, 2, 0, 2, 2, 0, 1, 1, 0, 1, 2, 0, 1, 2])  # C√≥ sai

class_names = ['Class A', 'Class B', 'Class C']

# Classification report
print("üìä CLASSIFICATION REPORT:")
print("=" * 70)
print(classification_report(y_true_mc, y_pred_mc, target_names=class_names))
print("=" * 70)

# Confusion matrix
plot_confusion_matrix(y_true_mc, y_pred_mc, class_names=class_names)

# Per-class metrics
precision_per_class = precision_score(y_true_mc, y_pred_mc, average=None)
recall_per_class = recall_score(y_true_mc, y_pred_mc, average=None)
f1_per_class = f1_score(y_true_mc, y_pred_mc, average=None)

print("\nüìà PER-CLASS METRICS:")
print("=" * 70)
print(f"{'Class':<15} {'Precision':<12} {'Recall':<12} {'F1-Score':<12}")
print("=" * 70)
for i, name in enumerate(class_names):
    print(f"{name:<15} {precision_per_class[i]:>8.4f}    {recall_per_class[i]:>8.4f}    {f1_per_class[i]:>8.4f}")
print("=" * 70)

## 3.4 Regression Metrics

### Common Regression Metrics

| Metric | Formula | √ù nghƒ©a | Khi n√†o d√πng |
|--------|---------|---------|---------------|
| **MAE** | mean(abs(y_true - y_pred)) | Average absolute error | Robust to outliers |
| **MSE** | mean((y_true - y_pred)¬≤) | Average squared error | Penalize large errors |
| **RMSE** | sqrt(MSE) | Root mean squared error | Same unit as target |
| **R¬≤ Score** | 1 - (SS_res / SS_tot) | Proportion of variance explained | Model goodness of fit |
| **MAPE** | mean(abs((y_true - y_pred) / y_true)) * 100 | Mean Absolute Percentage Error | When scale matters |

In [None]:
# Example: Regression Evaluation

# Gi·∫£ s·ª≠ c√≥ predictions
y_true_reg = np.array([3.0, -0.5, 2.0, 7.0, 4.5, 2.5, 1.0, 6.0, 3.5, 4.0])
y_pred_reg = np.array([2.5, 0.0, 2.1, 7.8, 4.0, 2.2, 1.5, 5.5, 3.8, 4.2])

# Calculate metrics
mae = mean_absolute_error(y_true_reg, y_pred_reg)
mse = mean_squared_error(y_true_reg, y_pred_reg)
rmse = np.sqrt(mse)
r2 = r2_score(y_true_reg, y_pred_reg)

# MAPE (careful with zero values!)
mape = np.mean(np.abs((y_true_reg - y_pred_reg) / y_true_reg)) * 100

print("üìä REGRESSION METRICS:")
print("=" * 50)
print(f"MAE (Mean Absolute Error):       {mae:.4f}")
print(f"MSE (Mean Squared Error):        {mse:.4f}")
print(f"RMSE (Root Mean Squared Error):  {rmse:.4f}")
print(f"R¬≤ Score:                        {r2:.4f}")
print(f"MAPE (Mean Absolute % Error):    {mape:.2f}%")
print("=" * 50)

# Visualize predictions vs actual
plt.figure(figsize=(12, 5))

# Scatter plot
plt.subplot(1, 2, 1)
plt.scatter(y_true_reg, y_pred_reg, alpha=0.6, s=100, edgecolors='black')
plt.plot([y_true_reg.min(), y_true_reg.max()], 
         [y_true_reg.min(), y_true_reg.max()], 
         'r--', lw=2, label='Perfect prediction')
plt.xlabel('True Values', fontsize=12)
plt.ylabel('Predictions', fontsize=12)
plt.title('Predictions vs True Values', fontsize=13, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)

# Residuals plot
plt.subplot(1, 2, 2)
residuals = y_true_reg - y_pred_reg
plt.scatter(y_pred_reg, residuals, alpha=0.6, s=100, edgecolors='black')
plt.axhline(y=0, color='r', linestyle='--', lw=2)
plt.xlabel('Predictions', fontsize=12)
plt.ylabel('Residuals', fontsize=12)
plt.title('Residual Plot', fontsize=13, fontweight='bold')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Gi·∫£i th√≠ch:")
print("   - Residual plot: N√™n random around 0")
print("   - Pattern trong residuals ‚Üí Model bias")
print("   - R¬≤ closer to 1 ‚Üí Better fit")

## 3.5 Cross-Validation

### Cross-Validation l√† g√¨?

**Cross-Validation** = Chia data th√†nh K folds, train tr√™n K-1 folds, test tr√™n 1 fold, l·∫∑p l·∫°i K l·∫ßn

### T·∫°i sao c·∫ßn Cross-Validation?

- ‚úÖ ƒê√°nh gi√° robust h∆°n single train/test split
- ‚úÖ S·ª≠ d·ª•ng to√†n b·ªô data cho training
- ‚úÖ Gi·∫£m variance trong evaluation
- ‚úÖ T·ªët cho small datasets

### C√°c lo·∫°i Cross-Validation

- **K-Fold CV**: Chia th√†nh K folds ƒë·ªÅu nhau
- **Stratified K-Fold**: Gi·ªØ nguy√™n t·ª∑ l·ªá classes
- **Leave-One-Out**: K = N (m·ªói sample l√† 1 fold)
- **Time Series CV**: Respect temporal order

In [None]:
# Example: K-Fold Cross-Validation

def cross_validate_model(model_fn, X, y, n_splits=5, stratified=True):
    """
    Perform K-Fold cross-validation
    
    Args:
        model_fn: Function that returns a compiled model
        X: Features
        y: Labels
        n_splits: Number of folds
        stratified: Use stratified K-fold
    
    Returns:
        scores: Dictionary of scores for each fold
    """
    # Choose K-Fold strategy
    if stratified:
        kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    else:
        kfold = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    
    # Store scores
    scores = {
        'accuracy': [],
        'precision': [],
        'recall': [],
        'f1': []
    }
    
    print(f"üîÑ Running {n_splits}-Fold Cross-Validation...\n")
    
    for fold, (train_idx, val_idx) in enumerate(kfold.split(X, y)):
        print(f"Fold {fold + 1}/{n_splits}")
        
        # Split data
        X_train, X_val = X[train_idx], X[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]
        
        # Create and train model
        model = model_fn()
        model.fit(
            X_train, y_train,
            epochs=10,
            batch_size=32,
            verbose=0
        )
        
        # Predict
        y_pred = (model.predict(X_val, verbose=0) > 0.5).astype(int).flatten()
        
        # Calculate metrics
        scores['accuracy'].append(accuracy_score(y_val, y_pred))
        scores['precision'].append(precision_score(y_val, y_pred, zero_division=0))
        scores['recall'].append(recall_score(y_val, y_pred, zero_division=0))
        scores['f1'].append(f1_score(y_val, y_pred, zero_division=0))
        
        print(f"  Accuracy: {scores['accuracy'][-1]:.4f}")
    
    print("\n‚úÖ Cross-Validation completed!")
    return scores

# Example v·ªõi simple model
def create_simple_model():
    model = keras.Sequential([
        layers.Dense(16, activation='relu', input_shape=(10,)),
        layers.Dense(8, activation='relu'),
        layers.Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

# Fake data
X_fake = np.random.rand(100, 10)
y_fake = (np.random.rand(100) > 0.5).astype(int)

# Run CV
cv_scores = cross_validate_model(create_simple_model, X_fake, y_fake, n_splits=5)

# Print summary
print("\nüìä CROSS-VALIDATION RESULTS:")
print("=" * 60)
for metric, values in cv_scores.items():
    mean_val = np.mean(values)
    std_val = np.std(values)
    print(f"{metric.capitalize():<12} {mean_val:.4f} ¬± {std_val:.4f}")
print("=" * 60)

## 3.6 Model Comparison Framework

In [None]:
class ModelComparator:
    """Framework ƒë·ªÉ so s√°nh nhi·ªÅu models"""
    
    def __init__(self):
        self.models = {}
        self.results = {}
    
    def add_model(self, name, model):
        """Add model to comparison"""
        self.models[name] = model
    
    def evaluate_all(self, X_test, y_test):
        """Evaluate all models"""
        for name, model in self.models.items():
            print(f"Evaluating {name}...")
            
            # Predict
            y_pred = model.predict(X_test, verbose=0)
            
            # Binary or multi-class
            if len(y_pred.shape) > 1 and y_pred.shape[1] > 1:
                y_pred_class = np.argmax(y_pred, axis=1)
            else:
                y_pred_class = (y_pred > 0.5).astype(int).flatten()
            
            # Calculate metrics
            self.results[name] = {
                'accuracy': accuracy_score(y_test, y_pred_class),
                'precision': precision_score(y_test, y_pred_class, average='weighted', zero_division=0),
                'recall': recall_score(y_test, y_pred_class, average='weighted', zero_division=0),
                'f1': f1_score(y_test, y_pred_class, average='weighted', zero_division=0)
            }
        
        print("‚úÖ All models evaluated!")
    
    def print_comparison(self):
        """Print comparison table"""
        print("\nüìä MODEL COMPARISON:")
        print("=" * 80)
        print(f"{'Model':<20} {'Accuracy':<12} {'Precision':<12} {'Recall':<12} {'F1-Score':<12}")
        print("=" * 80)
        
        for name, metrics in self.results.items():
            print(f"{name:<20} {metrics['accuracy']:>8.4f}    {metrics['precision']:>8.4f}    "
                  f"{metrics['recall']:>8.4f}    {metrics['f1']:>8.4f}")
        
        print("=" * 80)
    
    def plot_comparison(self):
        """Plot comparison bar chart"""
        metrics = ['accuracy', 'precision', 'recall', 'f1']
        model_names = list(self.results.keys())
        
        x = np.arange(len(metrics))
        width = 0.8 / len(model_names)
        
        plt.figure(figsize=(12, 6))
        
        for i, name in enumerate(model_names):
            values = [self.results[name][m] for m in metrics]
            plt.bar(x + i * width, values, width, label=name, alpha=0.8)
        
        plt.xlabel('Metrics', fontsize=12)
        plt.ylabel('Score', fontsize=12)
        plt.title('Model Comparison', fontsize=14, fontweight='bold')
        plt.xticks(x + width * (len(model_names) - 1) / 2, 
                   [m.capitalize() for m in metrics])
        plt.ylim(0, 1.1)
        plt.legend()
        plt.grid(axis='y', alpha=0.3)
        plt.tight_layout()
        plt.show()
    
    def get_best_model(self, metric='f1'):
        """Get best model by metric"""
        best_name = max(self.results, key=lambda x: self.results[x][metric])
        best_score = self.results[best_name][metric]
        return best_name, best_score

print("‚úÖ ModelComparator class defined!")
print("\nUsage:")
print("  comparator = ModelComparator()")
print("  comparator.add_model('Model A', model_a)")
print("  comparator.add_model('Model B', model_b)")
print("  comparator.evaluate_all(X_test, y_test)")
print("  comparator.print_comparison()")
print("  comparator.plot_comparison()")

---

# üéì T·ªïng k·∫øt FILE 3-B

## ‚úÖ Nh·ªØng g√¨ ƒë√£ h·ªçc

### 1. Clean ML Pipeline
- **Config Management**: YAML, Python class
- **Data Pipeline**: Modular, reproducible
- **Training Pipeline**: Config-driven, organized
- **Experiment Tracking**: Log everything!

### 2. Reproducibility
- **Random Seeds**: Set all seeds (Python, NumPy, TensorFlow)
- **Environment**: requirements.txt, environment info
- **Version Control**: Git, data versioning
- **Checklist**: Before/During/After training

### 3. Model Evaluation
- **Classification Metrics**: Accuracy, Precision, Recall, F1
- **Confusion Matrix**: Visualize errors
- **ROC & AUC**: Threshold-independent evaluation
- **Regression Metrics**: MAE, MSE, RMSE, R¬≤
- **Cross-Validation**: Robust evaluation
- **Model Comparison**: Framework ƒë·ªÉ so s√°nh models

## üöÄ Key Takeaways

1. **Clean code** = D·ªÖ maintain, scale, collaborate
2. **Config-driven** = Flexible, reproducible
3. **Reproducibility** = Must-have cho production
4. **Right metrics** = Quan tr·ªçng h∆°n high accuracy
5. **Cross-validation** = Robust evaluation cho small data

## üìù Best Practices Summary

### DO ‚úÖ
- Use config files (YAML khuy·∫øn ngh·ªã)
- Set all random seeds
- Log everything (hyperparams, metrics, environment)
- Version control code + config
- Choose metrics ph√π h·ª£p v·ªõi b√†i to√°n
- Use cross-validation cho small data
- Document thoroughly

### DON'T ‚ùå
- Magic numbers trong code
- Qu√™n set seeds
- Ch·ªâ nh√¨n accuracy (ƒë·∫∑c bi·ªát imbalanced data)
- Test tr√™n training data
- Kh√¥ng track experiments
- Kh√¥ng document

## üìù Next Steps

File ti·∫øp theo (**3-C**) s·∫Ω h·ªçc:
- Inference Pipeline
- Save & Load Models
- Model Deployment
- Performance Optimization
- Production Best Practices

---

**Ch√∫c m·ª´ng b·∫°n ƒë√£ ho√†n th√†nh FILE 3-B! üéâ**