# Module 03: Model Serialization

**Difficulty**: ⭐⭐ Intermediate  
**Estimated Time**: 45 minutes  
**Prerequisites**: 
- Module 01: Experiment Tracking with MLflow
- Module 02: Model Versioning and Registry
- Basic understanding of file I/O

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand different model serialization formats (pickle, joblib, ONNX)
2. Choose the appropriate serialization method for your use case
3. Save and load models using various formats
4. Compare file sizes and loading times across formats
5. Handle cross-platform and cross-language deployment scenarios
6. Implement best practices for model persistence

## 1. Why Model Serialization Matters

Model serialization is the process of converting trained models into a format that can be stored and later reconstructed.

**Without proper serialization:**
- ❌ Cannot deploy models to production
- ❌ Must retrain models every time
- ❌ Cannot share models with other systems
- ❌ Limited to single programming language

**With proper serialization:**
- ✅ Save trained models for reuse
- ✅ Deploy across different environments
- ✅ Share models between teams and systems
- ✅ Enable cross-platform deployment
- ✅ Optimize model size and loading speed

In [None]:
# Setup: Import required libraries
import pickle
import joblib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import time
import os
from pathlib import Path
import warnings

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
%matplotlib inline

# Set random seed
np.random.seed(42)

# Create directory for saved models
models_dir = Path('saved_models')
models_dir.mkdir(exist_ok=True)

print("✓ Libraries imported successfully")
print(f"✓ Models will be saved to: {models_dir.absolute()}")

## 2. Overview of Serialization Formats

### Main Serialization Methods:

| Format | Use Case | Pros | Cons |
|--------|----------|------|------|
| **Pickle** | Python-only, simple models | Built-in, easy to use | Python-specific, security risks |
| **Joblib** | Large NumPy arrays, sklearn | Efficient for numerical data | Python-specific |
| **ONNX** | Cross-platform deployment | Language-agnostic, optimized | Complex setup, limited model support |
| **SavedModel** | TensorFlow/Keras | Production-ready, versioning | TensorFlow-specific |
| **JSON** | Simple models, configs | Human-readable | Not for complex models |

## 3. Preparing Sample Data and Models

Let's create a dataset and train several models to demonstrate different serialization approaches.

In [None]:
# Generate synthetic dataset
X, y = make_classification(
    n_samples=5000,
    n_features=30,
    n_informative=25,
    n_redundant=5,
    n_classes=2,
    random_state=42
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

In [None]:
# Train multiple models for comparison
models = {}

# Logistic Regression (simple, small model)
lr_model = LogisticRegression(max_iter=200, random_state=42)
lr_model.fit(X_train, y_train)
models['logistic_regression'] = lr_model

# Decision Tree (medium complexity)
dt_model = DecisionTreeClassifier(max_depth=10, random_state=42)
dt_model.fit(X_train, y_train)
models['decision_tree'] = dt_model

# Random Forest (large, complex model)
rf_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf_model.fit(X_train, y_train)
models['random_forest'] = rf_model

# Print model accuracies
print("Model Training Complete:")
print("="*60)
for name, model in models.items():
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{name:20s}: {accuracy:.4f}")

## 4. Pickle: Python's Built-in Serialization

**Pickle** is Python's standard serialization format. It's simple but has limitations.

### When to use Pickle:
- ✅ Simple, small models
- ✅ Python-only deployment
- ✅ Quick prototyping

### When NOT to use Pickle:
- ❌ Production systems (security concerns)
- ❌ Cross-language deployment
- ❌ Large models with NumPy arrays

In [None]:
# Save model using pickle
pickle_file = models_dir / 'logistic_regression.pkl'

# Save model
with open(pickle_file, 'wb') as f:
    pickle.dump(models['logistic_regression'], f)

print(f"✓ Model saved to: {pickle_file}")
print(f"✓ File size: {pickle_file.stat().st_size / 1024:.2f} KB")

In [None]:
# Load model using pickle
start_time = time.time()

with open(pickle_file, 'rb') as f:
    loaded_model = pickle.load(f)

load_time = time.time() - start_time

# Verify model works
y_pred = loaded_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"✓ Model loaded successfully")
print(f"✓ Load time: {load_time*1000:.2f} ms")
print(f"✓ Accuracy: {accuracy:.4f}")
print(f"\nModel type: {type(loaded_model).__name__}")

## 5. Joblib: Optimized for Scikit-learn

**Joblib** is more efficient than pickle for large NumPy arrays, making it ideal for scikit-learn models.

### When to use Joblib:
- ✅ Scikit-learn models
- ✅ Large models with NumPy arrays
- ✅ Models with many estimators (Random Forest, etc.)

### Advantages over Pickle:
- Faster for large NumPy arrays
- More efficient compression
- Better for parallel processing

In [None]:
# Save Random Forest using joblib (better for large models)
joblib_file = models_dir / 'random_forest.joblib'

# Save with compression
joblib.dump(models['random_forest'], joblib_file, compress=3)

print(f"✓ Model saved to: {joblib_file}")
print(f"✓ File size: {joblib_file.stat().st_size / 1024:.2f} KB")

In [None]:
# Compare with pickle for the same model
pickle_rf_file = models_dir / 'random_forest.pkl'

with open(pickle_rf_file, 'wb') as f:
    pickle.dump(models['random_forest'], f)

# Compare file sizes
joblib_size = joblib_file.stat().st_size
pickle_size = pickle_rf_file.stat().st_size

print("File Size Comparison (Random Forest):")
print("="*60)
print(f"Joblib (compressed): {joblib_size / 1024:.2f} KB")
print(f"Pickle: {pickle_size / 1024:.2f} KB")
print(f"Savings: {(1 - joblib_size/pickle_size)*100:.1f}%")

In [None]:
# Load model using joblib
start_time = time.time()
loaded_rf = joblib.load(joblib_file)
load_time = time.time() - start_time

# Verify model
y_pred = loaded_rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"✓ Model loaded successfully")
print(f"✓ Load time: {load_time*1000:.2f} ms")
print(f"✓ Accuracy: {accuracy:.4f}")
print(f"✓ Number of estimators: {len(loaded_rf.estimators_)}")

## 6. Comparing Loading Times

Let's benchmark different serialization methods.

In [None]:
# Benchmark loading times
def benchmark_loading(file_path, load_function, n_iterations=10):
    """
    Measure average loading time for a serialized model.
    """
    times = []
    
    for _ in range(n_iterations):
        start = time.time()
        _ = load_function(file_path)
        times.append(time.time() - start)
    
    return np.mean(times), np.std(times)

# Benchmark results
results = []

# Pickle loading
pickle_mean, pickle_std = benchmark_loading(
    pickle_rf_file,
    lambda f: pickle.load(open(f, 'rb'))
)
results.append({
    'Method': 'Pickle',
    'Mean Time (ms)': pickle_mean * 1000,
    'Std (ms)': pickle_std * 1000,
    'File Size (KB)': pickle_rf_file.stat().st_size / 1024
})

# Joblib loading
joblib_mean, joblib_std = benchmark_loading(
    joblib_file,
    joblib.load
)
results.append({
    'Method': 'Joblib',
    'Mean Time (ms)': joblib_mean * 1000,
    'Std (ms)': joblib_std * 1000,
    'File Size (KB)': joblib_file.stat().st_size / 1024
})

# Display results
results_df = pd.DataFrame(results)
print("Loading Time Benchmark (Random Forest):")
print("="*80)
print(results_df.to_string(index=False))

# Determine winner
fastest = results_df.loc[results_df['Mean Time (ms)'].idxmin()]
print(f"\n✓ Fastest method: {fastest['Method']} ({fastest['Mean Time (ms)']:.2f} ms)")

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Loading time
ax1 = axes[0]
bars1 = ax1.bar(results_df['Method'], results_df['Mean Time (ms)'], 
                color=['steelblue', 'seagreen'], alpha=0.7)
ax1.errorbar(results_df['Method'], results_df['Mean Time (ms)'],
             yerr=results_df['Std (ms)'], fmt='none', color='black', capsize=5)
ax1.set_ylabel('Loading Time (ms)', fontweight='bold')
ax1.set_title('Model Loading Time Comparison', fontweight='bold')
ax1.grid(axis='y', alpha=0.3)

# Add value labels
for bar in bars1:
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height,
             f'{height:.2f}',
             ha='center', va='bottom')

# Plot 2: File size
ax2 = axes[1]
bars2 = ax2.bar(results_df['Method'], results_df['File Size (KB)'],
                color=['steelblue', 'seagreen'], alpha=0.7)
ax2.set_ylabel('File Size (KB)', fontweight='bold')
ax2.set_title('Serialized Model File Size', fontweight='bold')
ax2.grid(axis='y', alpha=0.3)

# Add value labels
for bar in bars2:
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height,
             f'{height:.1f}',
             ha='center', va='bottom')

plt.tight_layout()
plt.show()

print("\n✓ Joblib is typically faster and produces smaller files for scikit-learn models")

## 7. ONNX: Cross-Platform Model Format

**ONNX** (Open Neural Network Exchange) enables cross-platform and cross-language deployment.

### When to use ONNX:
- ✅ Deploy to non-Python environments (C++, Java, JavaScript)
- ✅ Mobile or edge deployment
- ✅ Performance optimization
- ✅ Framework interoperability

### Supported Models:
- Many scikit-learn models
- PyTorch, TensorFlow, Keras
- XGBoost, LightGBM

In [None]:
# Install ONNX libraries if needed
# !pip install skl2onnx onnxruntime

try:
    from skl2onnx import convert_sklearn
    from skl2onnx.common.data_types import FloatTensorType
    import onnxruntime as rt
    onnx_available = True
    print("✓ ONNX libraries available")
except ImportError:
    onnx_available = False
    print("⚠ ONNX libraries not installed")
    print("  To use ONNX, install: pip install skl2onnx onnxruntime")

In [None]:
if onnx_available:
    # Convert scikit-learn model to ONNX
    onnx_file = models_dir / 'random_forest.onnx'
    
    # Define input type (crucial for ONNX conversion)
    initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]
    
    # Convert model
    onnx_model = convert_sklearn(
        models['random_forest'],
        initial_types=initial_type,
        target_opset=12
    )
    
    # Save ONNX model
    with open(onnx_file, 'wb') as f:
        f.write(onnx_model.SerializeToString())
    
    print(f"✓ Model converted to ONNX")
    print(f"✓ Saved to: {onnx_file}")
    print(f"✓ File size: {onnx_file.stat().st_size / 1024:.2f} KB")
else:
    print("Skipping ONNX conversion (libraries not installed)")

In [None]:
if onnx_available:
    # Load and run ONNX model
    sess = rt.InferenceSession(str(onnx_file))
    
    # Get input name
    input_name = sess.get_inputs()[0].name
    
    # Make predictions
    # ONNX requires float32 input
    X_test_float32 = X_test.astype(np.float32)
    
    start_time = time.time()
    onnx_pred = sess.run(None, {input_name: X_test_float32})
    onnx_time = time.time() - start_time
    
    # ONNX returns a list, first element is predictions
    y_pred_onnx = onnx_pred[0]
    
    accuracy = accuracy_score(y_test, y_pred_onnx)
    
    print(f"✓ ONNX model loaded and executed")
    print(f"✓ Inference time: {onnx_time*1000:.2f} ms")
    print(f"✓ Accuracy: {accuracy:.4f}")
    
    # Compare with original model
    start_time = time.time()
    y_pred_sklearn = models['random_forest'].predict(X_test)
    sklearn_time = time.time() - start_time
    
    print(f"\nComparison:")
    print(f"  Scikit-learn inference: {sklearn_time*1000:.2f} ms")
    print(f"  ONNX inference: {onnx_time*1000:.2f} ms")
    print(f"  Speedup: {sklearn_time/onnx_time:.2f}x")
else:
    print("Skipping ONNX inference (libraries not installed)")

## 8. Model Metadata and Versioning

Always save metadata alongside your models for reproducibility.

In [None]:
import json
from datetime import datetime

# Create comprehensive metadata
model_metadata = {
    'model_name': 'random_forest_classifier',
    'model_type': 'RandomForestClassifier',
    'version': '1.0.0',
    'created_date': datetime.now().isoformat(),
    'framework': 'scikit-learn',
    'framework_version': '1.0.2',
    'parameters': {
        'n_estimators': 100,
        'max_depth': 10,
        'random_state': 42
    },
    'training_data': {
        'n_samples': len(X_train),
        'n_features': X_train.shape[1],
        'n_classes': 2
    },
    'performance': {
        'accuracy': accuracy_score(y_test, models['random_forest'].predict(X_test)),
        'test_size': len(X_test)
    },
    'serialization': {
        'format': 'joblib',
        'compression': 3,
        'file_size_kb': joblib_file.stat().st_size / 1024
    },
    'author': 'MLOps Team',
    'description': 'Random Forest model for binary classification'
}

# Save metadata
metadata_file = models_dir / 'random_forest_metadata.json'
with open(metadata_file, 'w') as f:
    json.dump(model_metadata, f, indent=2)

print("✓ Model metadata saved")
print(f"✓ Location: {metadata_file}")
print("\nMetadata content:")
print(json.dumps(model_metadata, indent=2))

In [None]:
# Function to load model with metadata validation
def load_model_with_validation(model_path, metadata_path):
    """
    Load a model and validate against its metadata.
    """
    # Load metadata
    with open(metadata_path, 'r') as f:
        metadata = json.load(f)
    
    # Load model
    model = joblib.load(model_path)
    
    # Validate model type
    expected_type = metadata['model_type']
    actual_type = type(model).__name__
    
    if actual_type != expected_type:
        raise ValueError(
            f"Model type mismatch! Expected {expected_type}, got {actual_type}"
        )
    
    print(f"✓ Model loaded and validated")
    print(f"  Name: {metadata['model_name']}")
    print(f"  Version: {metadata['version']}")
    print(f"  Created: {metadata['created_date']}")
    print(f"  Performance: {metadata['performance']}")
    
    return model, metadata

# Test the function
loaded_model, metadata = load_model_with_validation(joblib_file, metadata_file)

## 9. Best Practices for Model Serialization

In [None]:
# Best Practice 1: Version your models
def save_versioned_model(model, model_name, version, base_dir='saved_models'):
    """
    Save model with version number in filename.
    """
    base_path = Path(base_dir) / model_name
    base_path.mkdir(parents=True, exist_ok=True)
    
    # Create versioned filename
    filename = f"{model_name}_v{version}.joblib"
    filepath = base_path / filename
    
    # Save model
    joblib.dump(model, filepath, compress=3)
    
    # Save metadata
    metadata = {
        'model_name': model_name,
        'version': version,
        'saved_date': datetime.now().isoformat(),
        'file_path': str(filepath)
    }
    
    metadata_path = base_path / f"{model_name}_v{version}_metadata.json"
    with open(metadata_path, 'w') as f:
        json.dump(metadata, f, indent=2)
    
    print(f"✓ Saved {model_name} version {version}")
    print(f"  Model: {filepath}")
    print(f"  Metadata: {metadata_path}")
    
    return filepath, metadata_path

# Example usage
model_path, meta_path = save_versioned_model(
    models['random_forest'],
    'fraud_detector',
    '1.2.0'
)

In [None]:
# Best Practice 2: Include data preprocessing in serialization
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Create a pipeline that includes preprocessing
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=50, random_state=42))
])

# Train pipeline
pipeline.fit(X_train, y_train)

# Save entire pipeline
pipeline_file = models_dir / 'complete_pipeline.joblib'
joblib.dump(pipeline, pipeline_file)

print("✓ Complete pipeline saved (preprocessing + model)")
print(f"✓ File: {pipeline_file}")
print(f"\nPipeline steps:")
for name, step in pipeline.steps:
    print(f"  - {name}: {type(step).__name__}")

In [None]:
# Load and use pipeline
loaded_pipeline = joblib.load(pipeline_file)

# Make predictions (preprocessing is automatic)
y_pred = loaded_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"✓ Pipeline loaded and tested")
print(f"✓ Accuracy: {accuracy:.4f}")
print("\n✓ No need to manually apply preprocessing - it's built into the pipeline!")

## 10. Exercises

### Exercise 1: Serialization Format Comparison

Compare different serialization formats for various model types.

**Requirements:**
1. Train 3 different model types (e.g., Logistic Regression, SVM, Gradient Boosting)
2. Save each using both pickle and joblib
3. Compare:
   - File sizes
   - Loading times
   - Memory usage
4. Create a visualization showing the results

**Bonus**: Include compression levels in your comparison

In [None]:
# Your solution here

# TODO: Implement serialization comparison
# 1. Train multiple models
# 2. Save with different formats
# 3. Measure metrics
# 4. Visualize results

### Exercise 2: Create a Model Versioning System

Build a simple model versioning and management system.

**Requirements:**
1. Create functions to:
   - Save models with automatic version incrementing
   - List all saved model versions
   - Load a specific version
   - Compare performance across versions
2. Include metadata for each version (timestamp, parameters, performance)
3. Implement a function to rollback to a previous version

**Bonus**: Add a function to automatically save the best model based on a metric

In [None]:
# Your solution here

class ModelVersionManager:
    """Manage model versions with automatic serialization."""
    
    def __init__(self, base_dir='model_versions'):
        self.base_dir = Path(base_dir)
        self.base_dir.mkdir(exist_ok=True)
    
    def save_model(self, model, model_name, metrics=None, metadata=None):
        """Save model with automatic version increment."""
        # TODO: Implement
        pass
    
    def list_versions(self, model_name):
        """List all versions of a model."""
        # TODO: Implement
        pass
    
    def load_model(self, model_name, version=None):
        """Load a specific version (latest if not specified)."""
        # TODO: Implement
        pass
    
    def compare_versions(self, model_name):
        """Compare all versions of a model."""
        # TODO: Implement
        pass

# Test your implementation

### Exercise 3: Cross-Platform Deployment

Prepare a model for deployment in a different environment.

**Requirements:**
1. Train a model and save it in ONNX format
2. Create a simple inference script that:
   - Loads the ONNX model
   - Accepts input data
   - Returns predictions
   - Handles errors gracefully
3. Document the model's input/output schema
4. Create a README explaining how to use the model in production

**Bonus**: Test inference speed and compare with the original scikit-learn model

In [None]:
# Your solution here

# TODO: Implement cross-platform deployment
# 1. Convert model to ONNX
# 2. Create inference script
# 3. Document schema
# 4. Benchmark performance

## 11. Summary

### Key Concepts Covered

1. **Serialization Formats**: Pickle, Joblib, ONNX, and their use cases
2. **Performance Comparison**: File size and loading time trade-offs
3. **Best Practices**: Versioning, metadata, and validation
4. **Pipeline Serialization**: Saving complete preprocessing + model pipelines
5. **Cross-Platform Deployment**: Using ONNX for language-agnostic deployment

### Decision Guide: Which Format to Use?

```
Choose Pickle when:
  ✓ Simple, small models
  ✓ Python-only deployment
  ✓ Quick prototyping

Choose Joblib when:
  ✓ Scikit-learn models
  ✓ Large models with NumPy arrays
  ✓ Production Python deployment

Choose ONNX when:
  ✓ Cross-language deployment
  ✓ Mobile/edge deployment
  ✓ Performance optimization needed
  ✓ Framework interoperability
```

### Best Practices Summary

- ✅ **Always version your models** with semantic versioning
- ✅ **Save metadata** alongside models (parameters, metrics, dates)
- ✅ **Use pipelines** to include preprocessing in serialization
- ✅ **Validate on load** to catch deserialization errors early
- ✅ **Document input/output schemas** for production deployment
- ✅ **Test deserialization** in target environment before deploying
- ✅ **Use compression** for large models (joblib compress parameter)

### Common Pitfalls to Avoid

- ❌ Using pickle in production (security risks)
- ❌ Not versioning serialized models
- ❌ Forgetting to save preprocessing steps
- ❌ Not testing cross-platform compatibility
- ❌ Hardcoding file paths
- ❌ Not validating model integrity after loading

### What's Next

In **Module 04: Creating ML APIs with FastAPI**, we'll learn:
- Building REST APIs for model serving
- Request validation and error handling
- API documentation with OpenAPI
- Authentication and rate limiting

### Additional Resources

- **ONNX Documentation**: https://onnx.ai/
- **Joblib Documentation**: https://joblib.readthedocs.io/
- **Model Serialization Best Practices**: https://neptune.ai/blog/how-to-save-and-load-ml-models

---

## Next Steps

Proceed to **Module 04: Creating ML APIs with FastAPI** to learn how to serve your serialized models via REST APIs.

**Before moving on, ensure you can:**
- ✅ Save and load models using pickle and joblib
- ✅ Choose the appropriate serialization format for your use case
- ✅ Convert models to ONNX for cross-platform deployment
- ✅ Include metadata with serialized models
- ✅ Serialize complete pipelines (preprocessing + model)
- ✅ Implement model versioning